-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: read_csv does not raise UnicodeDecodeError on non utf-8 characters #39450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I will have time to check this later: I would have expected that you didn't get an error in <1.2 and >=1.2.1 (#38989), but you get an error only for 1.2.0. |
Yeah would have expected the same, but this is raising for me too on 1.1.5 |
It is my understanding that there should be an exception raised for non-unicode characters (except if in I used many versions between 0.23 and 1.2.0 (inclusive) and the exception was always raised. When I updated to 1.2.1, the exception stopped being raised which I noticed from failed unit tests in my application (testing that the exception is raised and handled properly). |
@DrGFreeman can you confirm that on 1.1.5 you get the error only when using the "c" engine? When I run your example on 1.1.5 with the python engine, I do not get an error. On 1.2.1 your example should fail with both engines, if you explicitly specify I made changes in 1.2.0 to share more file opening code between the c and python engine. The good news is that they raise errors in a more consistent way :) Is there a way to fix this issue and to keep #38989 fixed? In case there is no good solution for 1.2x: What would be the best solution for >=1.3? Default to |
@twoertwein, below are the results I get with versions 1.1.5 and 1.2.1 and different parameters: pandas==1.1.5
pandas==1.2.1
|
To summarize:
If we assume the python behavior is "correct", everythig works as expected in 1.2.1. If we assume that the c behavior (default engine!) is "correct", then we need to implement code to 1) always set I think the only way to implement the c behavior is to follow this stackoverflow answer:
|
I think exposing the errors keyword would make sense and set the default to strict? But I think it would also be reasonable to raise only if encoding is specified |
Do the following points sound like a good plan? I assume we cannot introduce a new argument in 1.2.x.
|
+1 to the plan |
Sounds good to me too |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
Bug exists in 1.2.1, not in <=1.2.0
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Output:
Problem description
In pandas version 1.2.1, reading a csv file containing non utf-8 characters does not raise a
UnicodeDecoreError
. In version 1.2.0 and earlier, aUnicodeDecodeError
is raised, allowing proper exception handling in application code.Expected Output
Expect a
UnicodeDecodeError
to be raised.Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: