Skip to content

BUG: read_csv is failing with an encoding different that UTF-8 and memory_map set to True in version 1.2.4 #40986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
emjironal opened this issue Apr 16, 2021 · 9 comments · Fixed by #40994
Closed
2 of 3 tasks
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@emjironal
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

df = pd.DataFrame({'name': ['Raphael', 'Donatello', 'Miguel Angel', 'Leonardo'],
                                     'mask': ['red', 'purple', 'orange', 'blue'],
                                     'weapon': ['sai', 'bo staff', 'nunchunk', 'katana']})
df.to_csv("tmnt.csv", index=False, encoding="utf-16")
pd.read_csv(filepath_or_buffer="tmnt.csv", encoding="utf-16", sep=",", header=0, decimal=".", memory_map=True)

Problem description

This works perfectly with version 1.1.1, but now it doesn't and since this is a nice feature because it removes I/O overhead, I think is good look into this and also because it could break many things

Expected Output

           name    mask    weapon
0       Raphael     red       sai
1     Donatello  purple  bo staff
2  Miguel Angel  orange  nunchunk
3      Leonardo    blue    katana

Traceback

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./lib/python3.8/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "./lib/python3.8/site-packages/pandas/io/parsers.py", line 462, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "./lib/python3.8/site-packages/pandas/io/parsers.py", line 819, in __init__
    self._engine = self._make_engine(self.engine)
  File "./lib/python3.8/site-packages/pandas/io/parsers.py", line 1050, in _make_engine
    return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
  File "./lib/python3.8/site-packages/pandas/io/parsers.py", line 1898, in __init__
    self._reader = parsers.TextReader(self.handles.handle, **kwds)
  File "pandas/_libs/parsers.pyx", line 518, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 649, in pandas._libs.parsers.TextReader._get_header
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 2cb96529396d93b46abab7bbc73a208e708c642e
python           : 3.8.8.final.0
python-bits      : 64
OS               : Darwin
OS-release       : 20.2.0
Version          : Darwin Kernel Version 20.2.0: Wed Dec  2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64
machine          : x86_64
processor        : i386
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.UTF-8

pandas           : 1.2.4
numpy            : 1.20.1
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.0.1
setuptools       : 54.0.0
Cython           : None
pytest           : 6.2.2
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.6.2
html5lib         : None
pymysql          : None
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : 3.0.6
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.6.2
sqlalchemy       : 1.4.1
tables           : None
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : None
numba            : None
@emjironal emjironal added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 16, 2021
@phofl
Copy link
Member

phofl commented Apr 16, 2021

cc @twoertwein any idea?

@phofl phofl added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 16, 2021
@twoertwein
Copy link
Member

Does it work with the python engine on 1.1.x and 1.2.x? pd.read_csv(filepath_or_buffer="tmnt.csv", encoding="utf-16", memory_map=True, ..., engine="python").

I will have more time to look into this next week.

@twoertwein twoertwein added the Regression Functionality that used to work in a prior pandas version label Apr 17, 2021
@twoertwein
Copy link
Member

It was easier to fix than expected :)

@twoertwein
Copy link
Member

Nope, I was too optimistic :(

@phofl phofl added this to the 1.2.5 milestone Apr 17, 2021
@amznero
Copy link
Contributor

amznero commented Apr 17, 2021

I looked into pandas/io/common.py and pandas/io/parser.py, found some points to discuss.

1. In common.py-Line-783, it only uses "utf-8" to decode bytes when users setting memory_mapping option, the encoding option has not effect.

newline = newbytes.decode("utf-8")

2. Can not decode a "utf-16" bytes(from f.readline()) directly if "\n" is at the end of the string.

Code Snippet.

# 1. write some txt

data = ["The quick brown", "fox jumps over", "the lazy dog"]
with open("utf16_test.txt", "w", encoding="utf-16") as f:
    for line in data:
        f.write(f"{line}\n")

# 2. read in "r" mode + "utf-16" encoding and readlines
with open("utf16_test.txt", "r", encoding="utf-16") as f:
    lines = f.readlines()
    print(f"lines lens: {len(lines)}")
    for line in lines:
        print(line, end="")

# lines lens: 3
# The quick brown
# fox jumps over
# the lazy dog

# 3. read in "rb" mode and readlines
with open("utf16_test.txt", "rb") as f:
    lines = f.readlines()
    print(f"lines lens: {len(lines)}")
    for each in lines:
        print(each)
    print("-"*50)
    for each in data:
        print(each.encode("utf-16"))

# lines lens: 4
# b'\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w\x00n\x00\n'
# b'\x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00s\x00 \x00o\x00v\x00e\x00r\x00\n' # abnormal bytes
# b'\x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00\n'           # abnormal bytes
# b'\x00'                                                                         # abnormal bytes
# --------------------------------------------------
# b'\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w\x00n\x00'
# b'\xff\xfef\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00s\x00 \x00o\x00v\x00e\x00r\x00'
# b'\xff\xfet\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00'


# lines[0].decode("utf-8") whill raise "UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 32: truncated data"

# works fine when ignore \n
lines[0][:-1].decode("utf-16")
# return "The quick brown"

strs = "The quick brown\n".encode("utf-16")
# \xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w\x00n\x00\n\x00
# append \x00 after \n

# 4. read in "rb" mode and readlines

with open("utf16_test.txt", "rb") as f:
    bytes_data = f.read()
    print(bytes_data)
    print(bytes_data.decode("utf-16"))

# b'\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w\x00n\x00\n\x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00s\x00 \x00o\x00v\x00e\x00r\x00\n\x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00\n\x00'
# The quick brown
# fox jumps over
# the lazy dog

3. After the file handle is mapped into a mmap object, the behavior of the wrapped handle becomes different from the raw handle. Maybe related to 2?

Code snippet

import mmap
import pandas as pd
from typing import cast
from pandas.io.common import _MMapWrapper

df = pd.DataFrame({'name': ['Raphael', 'Donatello', 'Miguel Angel', 'Leonardo'],
                                     'mask': ['red', 'purple', 'orange', 'blue'],
                                     'weapon': ['sai', 'bo staff', 'nunchunk', 'katana']})

df.to_csv("tmnt.csv", index=False, encoding="utf-16")

file_handle = open("tmnt.csv", "r", encoding="utf-16", newline="")

for _ in range(3):
    print(next(file_handle))

# Raphael,red,sai
# 
# Donatello,purple,bo staff
# 
# Miguel Angel,orange,nunchunk
# 

wrapped_handle = cast(mmap.mmap, _MMapWrapper(file_handle))
for _ in range(3):
    print(next(file_handle))
 
# b'\xff\xfen\x00a\x00m\x00e\x00,\x00m\x00a\x00s\x00k\x00,\x00w\x00e\x00a\x00p\x00o\x00n\x00\n'
# b'\x00R\x00a\x00p\x00h\x00a\x00e\x00l\x00,\x00r\x00e\x00d\x00,\x00s\x00a\x00i\x00\n'
# b'\x00D\x00o\x00n\x00a\x00t\x00e\x00l\x00l\x00o\x00,\x00p\x00u\x00r\x00p\x00l\x00e\x00,\x00b\x00o\x00 
\x00s\x00t\x00a\x00f\x00f\x00\n'

@twoertwein
Copy link
Member

@amznero if you find a solution, please feel free to open a PR!

About 1): I'm not sure how mmap interacts with file handles. In my quick PR (that fails because of 2)+3)) I assumed that mmap bypasses the en/decoding of the original file handle, meaning https://github.com/pandas-dev/pandas/blob/v1.2.4/pandas/io/common.py#L783 should use the user-provided encoding.

@amznero
Copy link
Contributor

amznero commented Apr 26, 2021

@twoertwein Sorry for the late reply

I thought this exception is caused by decoding in Binary mode and the filehandle truncate lines before the line terminator.

newbytes = self.mmap.readline()

Line-779 in common.py works fine with UTF-8 which is no extra characters in the tail and compatible with ASCII.

But there are big-endian and little-endian byte orders for UTF-16 and UTF-32(the default is little-endian). So the "\x00" is added to the end of the word. \n => \n\x00. Therefore readline will truncate the sentence prematurely and result in decoding failure(such as the example above in 2.3).

In [1]: "\n".encode("utf-16")
Out[1]: b'\xff\xfe\n\x00' #\xff\xfe is BOM of UTF-16

The same error will occur in UTF-32 encoding.

related stackoverflow: Python3 reading mixed text/binary data line-by-line

@twoertwein
Copy link
Member

@amznero In the __next__ function we are getting bytes line-by-line, so we needed to decode them incrementally (codecs.IncrementalDecoder instead of .decode()).

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue May 24, 2021
@simonjayhawkins
Copy link
Member

first bad commit: [a648fb2] REF/BUG/TYP: read_csv shouldn't close user-provided file handles (#36997)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants