Skip to content

EHN: Add encoding_errors option in pandas.DataFrame.to_csv (#27750) #27899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

shigemk2
Copy link

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Can you add a release note to doc/source/whatsnew/v1.0.0? Under enhancements.

@pep8speaks
Copy link

pep8speaks commented Aug 15, 2019

Hello @shigemk2! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-02 08:56:00 UTC

@shigemk2 shigemk2 changed the title Add errors option in pandas.DataFrame.to_csv EHN: Add errors option in pandas.DataFrame.to_csv (#27750) Aug 15, 2019
@jreback jreback added Enhancement IO CSV read_csv, to_csv labels Aug 15, 2019
@shigemk2 shigemk2 changed the title EHN: Add errors option in pandas.DataFrame.to_csv (#27750) EHN: Add encoding_errors option in pandas.DataFrame.to_csv (#27750) Aug 15, 2019
@shigemk2 shigemk2 force-pushed the to_csv_error branch 2 times, most recently from ff42832 to ea27b7f Compare August 15, 2019 13:55
@shigemk2
Copy link
Author

@TomAugspurger @jreback
Please review my PR.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very thorough on the tests :)

Do you have time to add the new parameter to user_guide/io.rst? We repeat most of the parameters there, since there are so many. You can place it near encoding (~line 331). You likely just copy-paste the docstring description.

Can you also make it clear that this is the errors argument passed to :func:\open\?

@shigemk2
Copy link
Author

@TomAugspurger

Can you also make it clear that this is the errors argument passed to :func:\open?

What do you mean?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 19, 2019 via email

@shigemk2
Copy link
Author

@TomAugspurger
Copy link
Contributor

@shigemk2
Copy link
Author

@shigemk2
Copy link
Author

@TomAugspurger @jreback
Please review my PR again.

@WillAyd
Copy link
Member

WillAyd commented Aug 23, 2019

Late to review but would we rather just document how to do this with the underlying file handle? Personally against adding new keywords to the read_* functions if they don't add functionality that can't already be done with the underlying fp object

@TomAugspurger
Copy link
Contributor

@WillAyd that was my initial reaction too. The thing that changed my mind was that we already accept the encoding and mode arguments.

Though I didn't realize open had so many arguments. With this change, we would (I think) support 4/8 of them.

@WillAyd
Copy link
Member

WillAyd commented Aug 23, 2019

Yea certainly a grey area. If we are only at 4/8 I think we would ideally either commit to all of them or start pushing people to passing through directly to the file constructor.

The latter of the two seems more reasonable though hence why I think documentation would be better (but not a hang up if you and other core devs feel strongly otherwise)

@shigemk2
Copy link
Author

According to https://docs.python.org/3/library/functions.html#open, current(0.25.x) pandas already supports 5/8 open's aguments in _get_handle function.

  • file
  • mode
  • buffering
  • encoding
  • errors
  • newline
  • closefd
  • opener

https://github.com/pandas-dev/pandas/blob/0.25.x/pandas/io/common.py#L396-L405

pandas/pandas/io/common.py

Lines 418 to 429 in 3bbb6f3

elif is_path:
if encoding:
# Encoding
f = open(
path_or_buf, mode, errors=encoding_errors, encoding=encoding, newline=""
)
elif is_text:
# No explicit encoding
f = open(path_or_buf, mode, errors=encoding_errors, newline="")
else:
# Binary mode
f = open(path_or_buf, mode)

So, I think we should support errors parameter because _get_handle already uses errors argument.

@shigemk2
Copy link
Author

shigemk2 commented Sep 2, 2019

@WillAyd @TomAugspurger
How do you think about my pull-request?

…v#27750)

encoding_errors : str, default 'strict'
Behavior when the input string can’t be converted according to
the encoding’s rules (strict, ignore, replace, etc.)
See: https://docs.python.org/3/library/codecs.html#codec-base-classes
@TomAugspurger
Copy link
Contributor

I'm leaning towards accepting the PR. @WillAyd does that sound OK?

@WillAyd
Copy link
Member

WillAyd commented Sep 3, 2019

While I think this PR in a nutshell is well done I think I'm still against the principal of adding this. For consistency wouldn't we then want to add to read_csv and other text based methods? Also do we see any concern with the fact that errors is the keyword argument to open but we now have encoding_errors in our API?

I just think to comprehensively support this would take a lot more effort than its worth so would rather just document how to do it instead of semi-supporting it via this method

@TomAugspurger
Copy link
Contributor

Thanks @WillAyd. All fair points.

In the past, we've come up against issues where we want to pass additional arguments through to the underlying filesystem calls. I wonder
if we can come up with a generic API here.

df.to_csv("foo.csv", storage_options={"mode", "w", "errors": "replace"})

Thoughts on that kind of API? It's a bit fuzzy on where exactly storage_options would go, but thoughts on the general approach?

@WillAyd
Copy link
Member

WillAyd commented Sep 3, 2019

Maybe we just add kwargs and dictate that kwargs get passed through to the filepath_or_buffer argument? Not sure if that has been discussed in the past but seems natural

@TomAugspurger
Copy link
Contributor

Not sure if it's been discussed before. It's certainly possible. Which do you think is easier to document, storage_options or **kwargs? IMO, they're about the same...

Do any of our readers / writers already accept **kwargs?

@shigemk2
Copy link
Author

shigemk2 commented Sep 5, 2019

@WillAyd @TomAugspurger
IMO, we should talk about only to_csv in order not to go off the rails.
Of course, we have to talk about raed_csv (and read_excel) error handling.
But my PR does not have any impact on read_csv.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 5, 2019 via email

@jreback
Copy link
Contributor

jreback commented Sep 5, 2019

I like @TomAugspurger suggestion of a dict of storage_options, composing keywords for encoding, errors, newline I think. would require some deprecation / translation period though. We are already doing this for compression options: #26024.

We could accept a dict (simpler), or make a named tuple StorageOptions (and change the compression above to be the same).

@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

So as far as this PR goes I would suggest going to the documentation route, showing how you can pass this parameter to a file handle prior to a read_csv call. Then we can open up a follow up issue for the larger effort of storage_options generically in our IO.

Does that sound good @shigemk2 ?

@shigemk2
Copy link
Author

shigemk2 commented Sep 8, 2019

@WillAyd
I do not know about how to handle read_csv's error, so I want to close this PR.

@TomAugspurger
Copy link
Contributor

OK. I opened #28377 as a followup.

Sorry this didn't end up making it in @shigemk2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suppress UnicodeEncodeError when executing to_csv method
5 participants