Skip to content

BUG: read_csv does not raise UnicodeDecodeError on non utf-8 characters #39450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
DrGFreeman opened this issue Jan 28, 2021 · 11 comments · Fixed by #39777
Closed
2 of 3 tasks

BUG: read_csv does not raise UnicodeDecodeError on non utf-8 characters #39450

DrGFreeman opened this issue Jan 28, 2021 · 11 comments · Fixed by #39777
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@DrGFreeman
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.
    Bug exists in 1.2.1, not in <=1.2.0

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from pathlib import Path

import pandas as pd

file = Path("non-utf8.csv")
file.write_bytes(b"\xe4\na\n1")  # non utf-8 character

df = pd.read_csv(file)

Output:

   �
0  a
1  1

Problem description

In pandas version 1.2.1, reading a csv file containing non utf-8 characters does not raise a UnicodeDecoreError. In version 1.2.0 and earlier, a UnicodeDecodeError is raised, allowing proper exception handling in application code.

Expected Output

Expect a UnicodeDecodeError to be raised.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 9d598a5e1eee26df95b3910e3f2934890d062caa
python           : 3.8.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.17763
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : French_Canada.1252

pandas           : 1.2.1
numpy            : 1.19.2
pytz             : 2020.5
dateutil         : 2.8.1
pip              : 20.3.3
setuptools       : 52.0.0.post20210125
Cython           : None
pytest           : 6.2.2
hypothesis       : None
sphinx           : 3.4.3
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.19.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : 0.8.3
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.15.1
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : 0.51.2
@DrGFreeman DrGFreeman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 28, 2021
@twoertwein
Copy link
Member

I will have time to check this later: I would have expected that you didn't get an error in <1.2 and >=1.2.1 (#38989), but you get an error only for 1.2.0.

@phofl
Copy link
Member

phofl commented Jan 28, 2021

Yeah would have expected the same, but this is raising for me too on 1.1.5

@DrGFreeman
Copy link
Contributor Author

DrGFreeman commented Jan 28, 2021

It is my understanding that there should be an exception raised for non-unicode characters (except if in skipped_rows, ref. #38989).

I used many versions between 0.23 and 1.2.0 (inclusive) and the exception was always raised.

When I updated to 1.2.1, the exception stopped being raised which I noticed from failed unit tests in my application (testing that the exception is raised and handled properly).

@twoertwein
Copy link
Member

I will try to understand the pre 1.2 behavior. I assume that #38997 overshot: restored some old behavior (#38989) but also introduced some new behavior (this issue).

@twoertwein
Copy link
Member

@DrGFreeman can you confirm that on 1.1.5 you get the error only when using the "c" engine? When I run your example on 1.1.5 with the python engine, I do not get an error. On 1.2.1 your example should fail with both engines, if you explicitly specify encoding when calling read_csv.

I made changes in 1.2.0 to share more file opening code between the c and python engine. The good news is that they raise errors in a more consistent way :)

Is there a way to fix this issue and to keep #38989 fixed?

In case there is no good solution for 1.2x: What would be the best solution for >=1.3? Default to errors='strict' but expose errors in read_csv and read_json (and any other text-reading function)?

@DrGFreeman
Copy link
Contributor Author

@DrGFreeman can you confirm that on 1.1.5 you get the error only when using the "c" engine? When I run your example on 1.1.5 with the python engine, I do not get an error. On 1.2.1 your example should fail with both engines, if you explicitly specify encoding when calling read_csv.

@twoertwein, below are the results I get with versions 1.1.5 and 1.2.1 and different parameters:

pandas==1.1.5

pd.read_csv(file) : UnicodeDecodeError raised
pd.read_csv(file, engine="c") : UnicodeDecodeError raised
pd.read_csv(file, engine="c", encoding="utf-8") : UnicodeDecodeError raised
pd.read_csv(file, engine="python") : No exception raised
pd.read_csv(file, engine="python", encoding="utf-8") : UnicodeDecodeErrorraised

pandas==1.2.1

pd.read_csv(file) : No exception raised
pd.read_csv(file, engine="c") : No exception raised
pd.read_csv(file, engine="c", encoding="utf-8") : UnicodeDecodeError raised
pd.read_csv(file, engine="python") : No exception raised
pd.read_csv(file, engine="python", encoding="utf-8") : UnicodeDecodeErrorraised

@twoertwein
Copy link
Member

To summarize:

  1. In all versions/engines: When an encoding is specified, errors="strict"
  2. There is one inconsistency between the c and python engine when no encoding is specified in versions <1.2.0:
    a. c engine raised an error for encoding errors, python engine did not
  3. In <1.2.0, skipping lines ignored errors only when no encoding is specified
  4. In 1.2.0: The c and python engine use the file opening code from the python engine and mistakenly errors="strict" even when no encoding specified
  5. In 1.2.1: errors="replace" when no encoding specified

If we assume the python behavior is "correct", everythig works as expected in 1.2.1. If we assume that the c behavior (default engine!) is "correct", then we need to implement code to 1) always set errors="strict" 2) but use errors="replace" for skipped rows.

I think the only way to implement the c behavior is to follow this stackoverflow answer:

Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescape handler and test each line read for codepoints in the surrogate range:

@phofl
Copy link
Member

phofl commented Jan 29, 2021

I think exposing the errors keyword would make sense and set the default to strict? But I think it would also be reasonable to raise only if encoding is specified

@twoertwein twoertwein added IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 29, 2021
@twoertwein
Copy link
Member

Do the following points sound like a good plan? I assume we cannot introduce a new argument in 1.2.x.

  1. for 1.2.2: update "encoding" documentation for read_csv to make clear that files will be opened with errors="replace" unless an encoding is specified
  2. for 1.3.0: all read_* that read text (are there more than read_csv and read_json?) will have a new errors argument that defaults to "strict" (it is probably a bad idea to push errors into storage_options)

@jreback
Copy link
Contributor

jreback commented Feb 2, 2021

Do the following points sound like a good plan? I assume we cannot introduce a new argument in 1.2.x.

  1. for 1.2.2: update "encoding" documentation for read_csv to make clear that files will be opened with errors="replace" unless an encoding is specified
  2. for 1.3.0: all read_* that read text (are there more than read_csv and read_json?) will have a new errors argument that defaults to "strict" (it is probably a bad idea to push errors into storage_options)

+1 to the plan

@phofl
Copy link
Member

phofl commented Feb 2, 2021

Sounds good to me too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants