-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_csv is failing with an encoding different that UTF-8 and memory_map set to True in version 1.2.4 #40986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @twoertwein any idea? |
Does it work with the python engine on 1.1.x and 1.2.x? I will have more time to look into this next week. |
It was easier to fix than expected :) |
Nope, I was too optimistic :( |
I looked into pandas/io/common.py and pandas/io/parser.py, found some points to discuss. 1. In common.py-Line-783, it only uses "utf-8" to decode bytes when users setting memory_mapping option, the encoding option has not effect.newline = newbytes.decode("utf-8") 2. Can not decode a "utf-16" bytes(from f.readline()) directly if "\n" is at the end of the string.Code Snippet. # 1. write some txt
data = ["The quick brown", "fox jumps over", "the lazy dog"]
with open("utf16_test.txt", "w", encoding="utf-16") as f:
for line in data:
f.write(f"{line}\n") # 2. read in "r" mode + "utf-16" encoding and readlines
with open("utf16_test.txt", "r", encoding="utf-16") as f:
lines = f.readlines()
print(f"lines lens: {len(lines)}")
for line in lines:
print(line, end="")
# lines lens: 3
# The quick brown
# fox jumps over
# the lazy dog # 3. read in "rb" mode and readlines
with open("utf16_test.txt", "rb") as f:
lines = f.readlines()
print(f"lines lens: {len(lines)}")
for each in lines:
print(each)
print("-"*50)
for each in data:
print(each.encode("utf-16"))
# lines lens: 4
# b'\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w\x00n\x00\n'
# b'\x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00s\x00 \x00o\x00v\x00e\x00r\x00\n' # abnormal bytes
# b'\x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00\n' # abnormal bytes
# b'\x00' # abnormal bytes
# --------------------------------------------------
# b'\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w\x00n\x00'
# b'\xff\xfef\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00s\x00 \x00o\x00v\x00e\x00r\x00'
# b'\xff\xfet\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00'
# lines[0].decode("utf-8") whill raise "UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 32: truncated data"
# works fine when ignore \n
lines[0][:-1].decode("utf-16")
# return "The quick brown"
strs = "The quick brown\n".encode("utf-16")
# \xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w\x00n\x00\n\x00
# append \x00 after \n # 4. read in "rb" mode and readlines
with open("utf16_test.txt", "rb") as f:
bytes_data = f.read()
print(bytes_data)
print(bytes_data.decode("utf-16"))
# b'\xff\xfeT\x00h\x00e\x00 \x00q\x00u\x00i\x00c\x00k\x00 \x00b\x00r\x00o\x00w\x00n\x00\n\x00f\x00o\x00x\x00 \x00j\x00u\x00m\x00p\x00s\x00 \x00o\x00v\x00e\x00r\x00\n\x00t\x00h\x00e\x00 \x00l\x00a\x00z\x00y\x00 \x00d\x00o\x00g\x00\n\x00'
# The quick brown
# fox jumps over
# the lazy dog 3. After the file handle is mapped into a mmap object, the behavior of the wrapped handle becomes different from the raw handle. Maybe related to 2?Code snippet import mmap
import pandas as pd
from typing import cast
from pandas.io.common import _MMapWrapper
df = pd.DataFrame({'name': ['Raphael', 'Donatello', 'Miguel Angel', 'Leonardo'],
'mask': ['red', 'purple', 'orange', 'blue'],
'weapon': ['sai', 'bo staff', 'nunchunk', 'katana']})
df.to_csv("tmnt.csv", index=False, encoding="utf-16")
file_handle = open("tmnt.csv", "r", encoding="utf-16", newline="")
for _ in range(3):
print(next(file_handle))
# Raphael,red,sai
#
# Donatello,purple,bo staff
#
# Miguel Angel,orange,nunchunk
#
wrapped_handle = cast(mmap.mmap, _MMapWrapper(file_handle))
for _ in range(3):
print(next(file_handle))
# b'\xff\xfen\x00a\x00m\x00e\x00,\x00m\x00a\x00s\x00k\x00,\x00w\x00e\x00a\x00p\x00o\x00n\x00\n'
# b'\x00R\x00a\x00p\x00h\x00a\x00e\x00l\x00,\x00r\x00e\x00d\x00,\x00s\x00a\x00i\x00\n'
# b'\x00D\x00o\x00n\x00a\x00t\x00e\x00l\x00l\x00o\x00,\x00p\x00u\x00r\x00p\x00l\x00e\x00,\x00b\x00o\x00
\x00s\x00t\x00a\x00f\x00f\x00\n' |
@amznero if you find a solution, please feel free to open a PR! About 1): I'm not sure how |
@twoertwein Sorry for the late reply I thought this exception is caused by decoding in
Line-779 in common.py works fine with UTF-8 which is no extra characters in the tail and compatible with ASCII. But there are big-endian and little-endian byte orders for UTF-16 and UTF-32(the default is little-endian). So the "\x00" is added to the end of the word. \n => \n\x00. Therefore In [1]: "\n".encode("utf-16")
Out[1]: b'\xff\xfe\n\x00' #\xff\xfe is BOM of UTF-16 The same error will occur in UTF-32 encoding.
|
@amznero In the |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
This works perfectly with version 1.1.1, but now it doesn't and since this is a nice feature because it removes I/O overhead, I think is good look into this and also because it could break many things
Expected Output
Traceback
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: