Skip to content

REGR: errors='replace' when encoding/errors are not specified #38997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 7, 2021
Merged

REGR: errors='replace' when encoding/errors are not specified #38997

merged 1 commit into from
Jan 7, 2021

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Jan 6, 2021

Should 1.3 use errors='replace' when no encoding/errors are specified or use errors=None (strict)?

@phofl
Copy link
Member

phofl commented Jan 6, 2021

I would say replace only when encoding is Not set and inferred to utf-8.

This should go back to 1.2.x

@simonjayhawkins simonjayhawkins added the IO CSV read_csv, to_csv label Jan 6, 2021
@simonjayhawkins simonjayhawkins added this to the 1.2.1 milestone Jan 6, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests!

@twoertwein
Copy link
Member Author

Should a follow-up PR intentionally revert this PR for 1.3? Personally, I would want to know when read_csv fails to read certain characters (errors = None by default).

@phofl
Copy link
Member

phofl commented Jan 7, 2021

I don't think that content from skipped rows should matter and it should raise no error if the rest of the file is ok. Maybe you could add a test withouth skiprows, this raises on 1.1.5 which is good and should keep raising

@@ -553,8 +553,10 @@ def get_handle(
Returns the dataclass IOHandles
"""
# Windows does not default to utf-8. Set to utf-8 for a consistent behavior
encoding_not_specified = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead why don't you

encoding_passed, encoding = encoding, encoding or 'utf-i'

and then you can test encoding_passed is not None its the same as you have but i think a bit more natural

@twoertwein
Copy link
Member Author

I thought errors is exposed in read_csv. It isn't. In that case it makes sense to keep 'replace' as the default.

I don't think it is feasible to ignore encoding errors only for skipped rows. (We could read everything in binary mode but we would need to decode it to determine line endings.)

@phofl
Copy link
Member

phofl commented Jan 7, 2021

This errors parameter is Not exposed

@twoertwein
Copy link
Member Author

yes, it isn't exposed. Sorry, that is what I meant to say :)

@jreback jreback merged commit 89ddd8a into pandas-dev:master Jan 7, 2021
@jreback
Copy link
Contributor

jreback commented Jan 7, 2021

thanks @twoertwein

@jreback
Copy link
Contributor

jreback commented Jan 7, 2021

@meeseeksdev backport 1.2.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jan 7, 2021

Something went wrong ... Please have a look at my logs.

@phofl
Copy link
Member

phofl commented Jan 7, 2021

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_csv raising when null bytes are in skipped rows
4 participants