Skip to content

read_csv can't roundtrip with UTF16/32 encodings #24130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
WillAyd opened this issue Dec 6, 2018 · 1 comment · Fixed by #30771
Closed

read_csv can't roundtrip with UTF16/32 encodings #24130

WillAyd opened this issue Dec 6, 2018 · 1 comment · Fixed by #30771
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@WillAyd
Copy link
Member

WillAyd commented Dec 6, 2018

This works fine:

In [4]: with tempfile.TemporaryFile(mode='w+', encoding='utf8') as outfile: 
   ...:     outfile.write('foo') 
   ...:     outfile.seek(0) 
   ...:     pd.read_csv(outfile, encoding='utf8') 

Not quite so lucky on these:

In [4]: with tempfile.TemporaryFile(mode='w+', encoding='utf16') as outfile: 
   ...:     outfile.write('foo') 
   ...:     outfile.seek(0) 
   ...:     pd.read_csv(outfile, encoding='utf16') 
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x6f in position 2: truncated data

In [4]: with tempfile.TemporaryFile(mode='w+', encoding='utf32') as outfile: 
   ...:     outfile.write('foo') 
   ...:     outfile.seek(0) 
   ...:     pd.read_csv(outfile, encoding='utf32') 
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-2: truncated data

I believe this is strictly a problem with the C parser.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: b78aa8d
python: 3.6.7.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+1223.gb78aa8d85
pytest: 4.0.0
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.1.1
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.1.2
lxml.etree: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: 0.1.6
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.2.0

@WillAyd WillAyd added the IO CSV read_csv, to_csv label Dec 6, 2018
@gfyoung
Copy link
Member

gfyoung commented Jan 5, 2020

This difference still persists on master for the C engine.

I should point out that these examples will work (for both engines) if you don't pass in the encoding parameter to read_csv in both cases, as passing in the encoding argument to TemporaryFile will cause it be to read as plaintext (decoding is done automatically).

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 7, 2020
And by utf-16, we mean the string "utf-16"

Closes pandas-dev#24130
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 7, 2020
And by utf-16, we mean the string "utf-16"

Closes pandas-dev#24130
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 7, 2020
And by utf-16, we mean the string "utf-16"

Closes pandas-dev#24130
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 7, 2020
And by utf-16, we mean the string "utf-16"

Closes pandas-dev#24130
@gfyoung gfyoung added the Bug label Jan 7, 2020
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 7, 2020
And by utf-16, we mean the string "utf-16"

Closes pandas-dev#24130
@jreback jreback added this to the 1.0 milestone Jan 7, 2020
jreback pushed a commit that referenced this issue Jan 7, 2020
And by utf-16, we mean the string "utf-16"

Closes #24130
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 8, 2020
And by utf-16, we mean the string "utf-16"

Closes pandas-dev#24130
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants