Skip to content

Codec utf-16 aliases do not work in read_csv with c engine #13549

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Anaphory opened this issue Jul 2, 2016 · 4 comments · Fixed by #14060
Closed

Codec utf-16 aliases do not work in read_csv with c engine #13549

Anaphory opened this issue Jul 2, 2016 · 4 comments · Fixed by #14060
Labels
IO CSV read_csv, to_csv Unicode Unicode strings
Milestone

Comments

@Anaphory
Copy link

Anaphory commented Jul 2, 2016

Code Sample, a copy-pastable example if possible

import pandas

path = "test.csv"
pandas.DataFrame({"A": [0,1], "B": [2,3]}).to_csv(
    path, encoding="utf-16")

for encoding in ["utf-16","utf_16","UTF_16","UTF-16"]:
    try:
        pandas.io.parsers.read_csv(
            path,
            engine='c',
            encoding=encoding)
        print(encoding, "succeeded in c-pandas")
    except UnicodeDecodeError:
        print(encoding, "failed in c-pandas")

    try:
        pandas.io.parsers.read_csv(
            path,
            engine='python',
            encoding=encoding)
        print(encoding, "succeeded in pandas")
    except UnicodeDecodeError:
        print(encoding, "failed in pandas")

    try:
        with open(path, encoding=encoding) as file:
            file.read()
        print(encoding, "succeeded in open")
    except UnicodeDecodeError:
        print(encoding, "failed in open")

Expected Output

utf-16 succeeded in c-pandas
utf-16 succeeded in pandas
utf-16 succeeded in open
utf_16 succeeded in c-pandas
utf_16 succeeded in pandas
utf_16 succeeded in open
UTF_16 succeeded in c-pandas
UTF_16 succeeded in pandas
UTF_16 succeeded in open
UTF-16 succeeded in c-pandas
UTF-16 succeeded in pandas
UTF-16 succeeded in open

Actual Output

utf-16 succeeded in c-pandas
utf-16 succeeded in pandas
utf-16 succeeded in open
utf_16 failed in c-pandas
utf_16 succeeded in pandas
utf_16 succeeded in open
UTF_16 failed in c-pandas
UTF_16 succeeded in pandas
UTF_16 succeeded in open
UTF-16 failed in c-pandas
UTF-16 succeeded in pandas
UTF-16 succeeded in open

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.6.2-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.1
setuptools: 23.0.0
Cython: None
numpy: 1.11.0
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@Anaphory
Copy link
Author

Anaphory commented Jul 2, 2016

Without error handling to run through all test cases, the traceback is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gereon/python/bugreport/test.py", line 11, in <module>
    engine='c',
  File "/usr/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python3.5/site-packages/pandas/io/parsers.py", line 325, in _read
    return parser.read()
  File "/usr/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
    ret = self._engine.read(nrows)
  File "/usr/lib/python3.5/site-packages/pandas/io/parsers.py", line 1314, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
  File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
  File "pandas/parser.pyx", line 904, in pandas.parser.TextReader._read_rows (pandas/parser.c:10022)
  File "pandas/parser.pyx", line 1011, in pandas.parser.TextReader._convert_column_data (pandas/parser.c:11397)
  File "pandas/parser.pyx", line 1071, in pandas.parser.TextReader._convert_tokens (pandas/parser.c:12302)
  File "pandas/parser.pyx", line 1157, in pandas.parser.TextReader._convert_with_dtype (pandas/parser.c:13740)
  File "pandas/parser.pyx", line 1173, in pandas.parser.TextReader._string_convert (pandas/parser.c:13950)
  File "pandas/parser.pyx", line 1460, in pandas.parser._string_box_decode (pandas/parser.c:19767)
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x41 in position 0: truncated data

@Anaphory
Copy link
Author

Anaphory commented Jul 2, 2016

In case it is relevant:

7.2.3. Standard Encodings

[…]Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec.

CPython implementation detail: Some common encodings can bypass the codecs lookup machinery […]: […] utf-16 […]

@Anaphory Anaphory changed the title Codec utf-16 synonyms do not work in read_csv with c engine Codec utf-16 aliases do not work in read_csv with c engine Jul 2, 2016
@Anaphory
Copy link
Author

Anaphory commented Jul 2, 2016

This also holds if the writing encoding is taken to be utf_16:

>>> pandas.DataFrame({"A": [0,1], "B": [2,3]}).to_csv(
        path, encoding="utf_16")
[…]
utf_16 failed in c-pandas
utf_16 succeeded in pandas
utf_16 succeeded in open

@jreback
Copy link
Contributor

jreback commented Jul 3, 2016

yeah this has a string definition of utf-16. Should be easy to make more general.

want to take a crack at it?

@jreback jreback added Difficulty Novice Unicode Unicode strings IO CSV read_csv, to_csv labels Jul 3, 2016
@jreback jreback modified the milestones: 0.19.0, Next Major Release Jul 3, 2016
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Jul 5, 2016
see issue pandas-dev#13549
read_csv with engine=c throws error when encoding=UTF_16
or when encoding has _ or caps
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Jul 6, 2016
read_csv with engine=c throws error when encoding=UTF_16
or when encoding has _ or uppercase
improved testing loops and added multibyte testing
see issue pandas-dev#13549
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Jul 6, 2016
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Jul 6, 2016
@jreback jreback modified the milestones: 0.18.2, Next Major Release Jul 6, 2016
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Jul 12, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 15, 2016
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 15, 2016
see issue pandas-dev#13549
read_csv with engine=c throws error when encoding=UTF_16
or when encoding has _ or caps
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 15, 2016
read_csv with engine=c throws error when encoding=UTF_16
or when encoding has _ or uppercase
improved testing loops and added multibyte testing
see issue pandas-dev#13549
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 15, 2016
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 15, 2016
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 15, 2016
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 15, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, Next Major Release Aug 19, 2016
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 19, 2016
change encoding to lowercase
sub - for _

see pandas-dev#13549
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 19, 2016
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 19, 2016
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 21, 2016
change encoding to lowercase
sub - for _
see pandas-dev#13549
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 21, 2016
test utf-16 and utf-8 caps/
-_ variants
see pandas-dev#13549
nateGeorge added a commit to nateGeorge/pandas that referenced this issue Aug 21, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Unicode Unicode strings
Projects
None yet
3 participants