BUG: read_csv does not raise UnicodeDecodeError on non utf-8 characters #39450

DrGFreeman · 2021-01-28T15:49:39Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
Bug exists in 1.2.1, not in <=1.2.0
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

from pathlib import Path

import pandas as pd

file = Path("non-utf8.csv")
file.write_bytes(b"\xe4\na\n1")  # non utf-8 character

df = pd.read_csv(file)

Output:

   �
0  a
1  1

Problem description

In pandas version 1.2.1, reading a csv file containing non utf-8 characters does not raise a UnicodeDecoreError. In version 1.2.0 and earlier, a UnicodeDecodeError is raised, allowing proper exception handling in application code.

Expected Output

Expect a UnicodeDecodeError to be raised.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit           : 9d598a5e1eee26df95b3910e3f2934890d062caa
python           : 3.8.5.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.17763
machine          : AMD64
processor        : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : French_Canada.1252

pandas           : 1.2.1
numpy            : 1.19.2
pytz             : 2020.5
dateutil         : 2.8.1
pip              : 20.3.3
setuptools       : 52.0.0.post20210125
Cython           : None
pytest           : 6.2.2
hypothesis       : None
sphinx           : 3.4.3
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.19.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : 0.8.3
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.2
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 0.15.1
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : 0.51.2

The text was updated successfully, but these errors were encountered:

twoertwein · 2021-01-28T18:00:57Z

I will have time to check this later: I would have expected that you didn't get an error in <1.2 and >=1.2.1 (#38989), but you get an error only for 1.2.0.

phofl · 2021-01-28T18:26:27Z

Yeah would have expected the same, but this is raising for me too on 1.1.5

DrGFreeman · 2021-01-28T18:30:35Z

It is my understanding that there should be an exception raised for non-unicode characters (except if in skipped_rows, ref. #38989).

I used many versions between 0.23 and 1.2.0 (inclusive) and the exception was always raised.

When I updated to 1.2.1, the exception stopped being raised which I noticed from failed unit tests in my application (testing that the exception is raised and handled properly).

twoertwein · 2021-01-28T20:57:58Z

I will try to understand the pre 1.2 behavior. I assume that #38997 overshot: restored some old behavior (#38989) but also introduced some new behavior (this issue).

twoertwein · 2021-01-29T00:00:37Z

@DrGFreeman can you confirm that on 1.1.5 you get the error only when using the "c" engine? When I run your example on 1.1.5 with the python engine, I do not get an error. On 1.2.1 your example should fail with both engines, if you explicitly specify encoding when calling read_csv.

I made changes in 1.2.0 to share more file opening code between the c and python engine. The good news is that they raise errors in a more consistent way :)

Is there a way to fix this issue and to keep #38989 fixed?

In case there is no good solution for 1.2x: What would be the best solution for >=1.3? Default to errors='strict' but expose errors in read_csv and read_json (and any other text-reading function)?

DrGFreeman · 2021-01-29T06:10:42Z

@DrGFreeman can you confirm that on 1.1.5 you get the error only when using the "c" engine? When I run your example on 1.1.5 with the python engine, I do not get an error. On 1.2.1 your example should fail with both engines, if you explicitly specify encoding when calling read_csv.

@twoertwein, below are the results I get with versions 1.1.5 and 1.2.1 and different parameters:

pandas==1.1.5

pd.read_csv(file) : UnicodeDecodeError raised
pd.read_csv(file, engine="c") : UnicodeDecodeError raised
pd.read_csv(file, engine="c", encoding="utf-8") : UnicodeDecodeError raised
pd.read_csv(file, engine="python") : No exception raised
pd.read_csv(file, engine="python", encoding="utf-8") : UnicodeDecodeErrorraised

pandas==1.2.1

pd.read_csv(file) : No exception raised
pd.read_csv(file, engine="c") : No exception raised
pd.read_csv(file, engine="c", encoding="utf-8") : UnicodeDecodeError raised
pd.read_csv(file, engine="python") : No exception raised
pd.read_csv(file, engine="python", encoding="utf-8") : UnicodeDecodeErrorraised

twoertwein · 2021-01-29T16:14:02Z

To summarize:

In all versions/engines: When an encoding is specified, errors="strict"
There is one inconsistency between the c and python engine when no encoding is specified in versions <1.2.0:
a. c engine raised an error for encoding errors, python engine did not
In <1.2.0, skipping lines ignored errors only when no encoding is specified
In 1.2.0: The c and python engine use the file opening code from the python engine and mistakenly errors="strict" even when no encoding specified
In 1.2.1: errors="replace" when no encoding specified

If we assume the python behavior is "correct", everythig works as expected in 1.2.1. If we assume that the c behavior (default engine!) is "correct", then we need to implement code to 1) always set errors="strict" 2) but use errors="replace" for skipped rows.

I think the only way to implement the c behavior is to follow this stackoverflow answer:

Note that decoding takes place per buffered block of data, not per textual line. If you must detect errors on a line-by-line basis, use the surrogateescape handler and test each line read for codepoints in the surrogate range:

phofl · 2021-01-29T16:26:28Z

I think exposing the errors keyword would make sense and set the default to strict? But I think it would also be reasonable to raise only if encoding is specified

twoertwein · 2021-01-29T18:05:35Z

Do the following points sound like a good plan? I assume we cannot introduce a new argument in 1.2.x.

for 1.2.2: update "encoding" documentation for read_csv to make clear that files will be opened with errors="replace" unless an encoding is specified
for 1.3.0: all read_* that read text (are there more than read_csv and read_json?) will have a new errors argument that defaults to "strict" (it is probably a bad idea to push errors into storage_options)

jreback · 2021-02-02T01:39:15Z

Do the following points sound like a good plan? I assume we cannot introduce a new argument in 1.2.x.

for 1.2.2: update "encoding" documentation for read_csv to make clear that files will be opened with errors="replace" unless an encoding is specified

for 1.3.0: all read_* that read text (are there more than read_csv and read_json?) will have a new errors argument that defaults to "strict" (it is probably a bad idea to push errors into storage_options)

+1 to the plan

phofl · 2021-02-02T21:16:21Z

Sounds good to me too

DrGFreeman added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 28, 2021

twoertwein added IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 29, 2021

twoertwein mentioned this issue Jan 30, 2021

DOC: Document how encoding errors are handled #39492

Merged

twoertwein mentioned this issue Feb 12, 2021

ENH: 'encoding_errors' argument for read_csv/json #39777

Merged

5 tasks

jreback added this to the 1.3 milestone Feb 12, 2021

jreback closed this as completed in #39777 Mar 9, 2021

collinwr mentioned this issue Dec 8, 2023

[DEV-10415] Fix Pandas 1.3.* Upgrade Encoding Issue fedspendingtransparency/usaspending-api#3978

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_csv does not raise UnicodeDecodeError on non utf-8 characters #39450

BUG: read_csv does not raise UnicodeDecodeError on non utf-8 characters #39450

DrGFreeman commented Jan 28, 2021

twoertwein commented Jan 28, 2021

phofl commented Jan 28, 2021

DrGFreeman commented Jan 28, 2021 •

edited

Loading

twoertwein commented Jan 28, 2021

twoertwein commented Jan 29, 2021

DrGFreeman commented Jan 29, 2021

twoertwein commented Jan 29, 2021

phofl commented Jan 29, 2021

twoertwein commented Jan 29, 2021

jreback commented Feb 2, 2021

phofl commented Feb 2, 2021

BUG: read_csv does not raise UnicodeDecodeError on non utf-8 characters #39450

BUG: read_csv does not raise UnicodeDecodeError on non utf-8 characters #39450

Comments

DrGFreeman commented Jan 28, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

twoertwein commented Jan 28, 2021

phofl commented Jan 28, 2021

DrGFreeman commented Jan 28, 2021 • edited Loading

twoertwein commented Jan 28, 2021

twoertwein commented Jan 29, 2021

DrGFreeman commented Jan 29, 2021

pandas==1.1.5

pandas==1.2.1

twoertwein commented Jan 29, 2021

phofl commented Jan 29, 2021

twoertwein commented Jan 29, 2021

jreback commented Feb 2, 2021

phofl commented Feb 2, 2021

Output of `pd.show_versions()`

DrGFreeman commented Jan 28, 2021 •

edited

Loading