Skip to content

ENH: Add encoding errors option in pandas.read_csv #39017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidleejy opened this issue Jan 7, 2021 · 3 comments · Fixed by #39777
Closed

ENH: Add encoding errors option in pandas.read_csv #39017

davidleejy opened this issue Jan 7, 2021 · 3 comments · Fixed by #39777
Labels
Enhancement IO CSV read_csv, to_csv
Milestone

Comments

@davidleejy
Copy link

Related to problem:

df.to_csv('abc.csv', errors='surrogatepass')   # saving works fine.

# Try to load:

# Attempt 1:
pd.read_csv('abc.csv') 
# Fails.    UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30682-30685: surrogates not allowed

# Attempt 2:
pd.read_csv('abc.csv', errors='surrogatepass') 
# Fails. No `errors` parameter.

# Attempt 3:
with open('abc.csv', errors='surrogatepass') as _file:
    df = pd.read_csv(_file)
# Fails.    UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30682-30685: surrogates not allowed

Describe the solution you'd like

Recently, we added errors as a function parameter to to_csv in this merged PR. Can we do the same for read_csv? This solution would make Attempt 2 work.

(Not sure why Attempt 3 doesn't work since read_csv accepts a file handler object.)

API breaking implications

Should not break.

Describe alternatives you've considered

see (futile) Attempt 3 above.

Additional context

Section "Error handlers" in https://docs.python.org/3/library/codecs.html says:

Screenshot 2021-01-07 at 7 57 20 PM

Example of encoding & decoding:

x="\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
print(x)
# prints 🙏
@davidleejy davidleejy added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 7, 2021
@simonjayhawkins simonjayhawkins added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2021
@phofl
Copy link
Member

phofl commented Feb 19, 2021

@twoertwein I think you are currently adressing this?

@twoertwein
Copy link
Member

yes #39777 will add support for that. I will rebase the PR and also add the example from this issue as a testcase.

@twoertwein
Copy link
Member

twoertwein commented Feb 19, 2021

@davidleejy do you have an example that needs errors='surrogatepass' for decode?

Your example x="\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16') needs errors='surrogatepass' during encode which is not what read_csv does internally.

edit: I found a different example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants