EHN: Add encoding_errors option in pandas.DataFrame.to_csv (#27750) #27899

shigemk2 · 2019-08-13T11:39:49Z

closes Suppress UnicodeEncodeError when executing to_csv method #27750
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

TomAugspurger

Thanks. Can you add a release note to doc/source/whatsnew/v1.0.0? Under enhancements.

pandas/core/generic.py

pandas/tests/io/formats/test_to_csv.py

pep8speaks · 2019-08-15T10:18:58Z

Hello @shigemk2! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-09-02 08:56:00 UTC

doc/source/whatsnew/v1.0.0.rst

pandas/core/generic.py

shigemk2 · 2019-08-19T04:52:46Z

@TomAugspurger @jreback
Please review my PR.

TomAugspurger

Very thorough on the tests :)

Do you have time to add the new parameter to user_guide/io.rst? We repeat most of the parameters there, since there are so many. You can place it near encoding (~line 331). You likely just copy-paste the docstring description.

Can you also make it clear that this is the errors argument passed to :func:\open\?

pandas/core/generic.py

pandas/io/common.py

pandas/tests/io/formats/test_to_csv.py

shigemk2 · 2019-08-19T12:00:44Z

@TomAugspurger

Can you also make it clear that this is the errors argument passed to :func:\open?

What do you mean?

TomAugspurger · 2019-08-19T12:39:19Z

Include that explicit reference in the docstring.

…

On Aug 19, 2019, at 07:00, Michihito Shigemura ***@***.***> wrote: @TomAugspurger Can you also make it clear that this is the errors argument passed to :func:\open? What do you mean? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

shigemk2 · 2019-08-19T13:07:52Z

@TomAugspurger

Do you mean to add to_csv's example in https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#writing-out-data ?

TomAugspurger · 2019-08-19T13:12:45Z

I mean under https://dev.pandas.io/user_guide/io.html#quoting-compression-and-file-format

shigemk2 · 2019-08-19T13:23:24Z

I add documents for error options both in https://dev.pandas.io/user_guide/io.html#quoting-compression-and-file-format and https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#writing-out-data .

doc/source/user_guide/io.rst

shigemk2 · 2019-08-22T03:55:11Z

@TomAugspurger @jreback
Please review my PR again.

WillAyd · 2019-08-23T13:22:02Z

Late to review but would we rather just document how to do this with the underlying file handle? Personally against adding new keywords to the read_* functions if they don't add functionality that can't already be done with the underlying fp object

TomAugspurger · 2019-08-23T13:52:52Z

@WillAyd that was my initial reaction too. The thing that changed my mind was that we already accept the encoding and mode arguments.

Though I didn't realize open had so many arguments. With this change, we would (I think) support 4/8 of them.

WillAyd · 2019-08-23T13:55:42Z

Yea certainly a grey area. If we are only at 4/8 I think we would ideally either commit to all of them or start pushing people to passing through directly to the file constructor.

The latter of the two seems more reasonable though hence why I think documentation would be better (but not a hang up if you and other core devs feel strongly otherwise)

shigemk2 · 2019-08-23T14:35:35Z

According to https://docs.python.org/3/library/functions.html#open, current(0.25.x) pandas already supports 5/8 open's aguments in _get_handle function.

https://github.com/pandas-dev/pandas/blob/0.25.x/pandas/io/common.py#L396-L405

pandas/pandas/io/common.py

Lines 418 to 429 in 3bbb6f3

    
           elif is_path: 
        
               if encoding: 
        
                   # Encoding 
        
                   f = open( 
        
                       path_or_buf, mode, errors=encoding_errors, encoding=encoding, newline="" 
        
                   ) 
        
               elif is_text: 
        
                   # No explicit encoding 
        
                   f = open(path_or_buf, mode, errors=encoding_errors, newline="") 
        
               else: 
        
                   # Binary mode 
        
                   f = open(path_or_buf, mode)

So, I think we should support errors parameter because _get_handle already uses errors argument.

shigemk2 · 2019-09-02T08:46:07Z

@WillAyd @TomAugspurger
How do you think about my pull-request?

…v#27750) encoding_errors : str, default 'strict' Behavior when the input string can’t be converted according to the encoding’s rules (strict, ignore, replace, etc.) See: https://docs.python.org/3/library/codecs.html#codec-base-classes

TomAugspurger · 2019-09-03T18:03:52Z

I'm leaning towards accepting the PR. @WillAyd does that sound OK?

WillAyd · 2019-09-03T21:10:00Z

While I think this PR in a nutshell is well done I think I'm still against the principal of adding this. For consistency wouldn't we then want to add to read_csv and other text based methods? Also do we see any concern with the fact that errors is the keyword argument to open but we now have encoding_errors in our API?

I just think to comprehensively support this would take a lot more effort than its worth so would rather just document how to do it instead of semi-supporting it via this method

TomAugspurger · 2019-09-03T21:24:39Z

Thanks @WillAyd. All fair points.

In the past, we've come up against issues where we want to pass additional arguments through to the underlying filesystem calls. I wonder
if we can come up with a generic API here.

df.to_csv("foo.csv", storage_options={"mode", "w", "errors": "replace"})

Thoughts on that kind of API? It's a bit fuzzy on where exactly storage_options would go, but thoughts on the general approach?

WillAyd · 2019-09-03T21:26:50Z

Maybe we just add kwargs and dictate that kwargs get passed through to the filepath_or_buffer argument? Not sure if that has been discussed in the past but seems natural

TomAugspurger · 2019-09-04T15:17:50Z

Not sure if it's been discussed before. It's certainly possible. Which do you think is easier to document, storage_options or **kwargs? IMO, they're about the same...

Do any of our readers / writers already accept **kwargs?

shigemk2 · 2019-09-05T05:34:36Z

@WillAyd @TomAugspurger
IMO, we should talk about only to_csv in order not to go off the rails.
Of course, we have to talk about raed_csv (and read_excel) error handling.
But my PR does not have any impact on read_csv.

TomAugspurger · 2019-09-05T11:12:20Z

We want to provide a consistent API for all our readers, though. We need to design something that would work for more than just to_csv.

…

On Thu, Sep 5, 2019 at 12:34 AM Michihito Shigemura < ***@***.***> wrote: @WillAyd <https://github.com/WillAyd> @TomAugspurger <https://github.com/TomAugspurger> IMO, we should talk about only to_csv in order not to go off the rails. Of course, we have to talk about raed_csv (and read_excel) error handling. But my PR does not have any impact on read_csv. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#27899?email_source=notifications&email_token=AAKAOIS2PXSECA33JNG2PLDQICK7LA5CNFSM4ILJ4OTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD554FOA#issuecomment-528204472>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIWXJUCANZKAW3VK4W3QICK7LANCNFSM4ILJ4OTA> .

jreback · 2019-09-05T11:46:27Z

I like @TomAugspurger suggestion of a dict of storage_options, composing keywords for encoding, errors, newline I think. would require some deprecation / translation period though. We are already doing this for compression options: #26024.

We could accept a dict (simpler), or make a named tuple StorageOptions (and change the compression above to be the same).

WillAyd · 2019-09-05T15:10:30Z

So as far as this PR goes I would suggest going to the documentation route, showing how you can pass this parameter to a file handle prior to a read_csv call. Then we can open up a follow up issue for the larger effort of storage_options generically in our IO.

Does that sound good @shigemk2 ?

shigemk2 · 2019-09-08T07:02:04Z

@WillAyd
I do not know about how to handle read_csv's error, so I want to close this PR.

TomAugspurger · 2019-09-10T21:05:20Z

OK. I opened #28377 as a followup.

Sorry this didn't end up making it in @shigemk2.

shigemk2 force-pushed the to_csv_error branch from 3aea10c to fc19062 Compare August 14, 2019 04:00

TomAugspurger reviewed Aug 14, 2019

View reviewed changes

shigemk2 force-pushed the to_csv_error branch from fc19062 to 9f6d207 Compare August 15, 2019 10:18

shigemk2 changed the title ~~Add errors option in pandas.DataFrame.to_csv~~ EHN: Add errors option in pandas.DataFrame.to_csv (#27750) Aug 15, 2019

shigemk2 force-pushed the to_csv_error branch from 9f6d207 to b92b243 Compare August 15, 2019 13:03

jreback requested changes Aug 15, 2019

View reviewed changes

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

pandas/core/generic.py Outdated Show resolved Hide resolved

jreback added Enhancement IO CSV read_csv, to_csv labels Aug 15, 2019

shigemk2 force-pushed the to_csv_error branch from b92b243 to cbfb1ec Compare August 15, 2019 13:49

shigemk2 changed the title ~~EHN: Add errors option in pandas.DataFrame.to_csv (#27750)~~ EHN: Add encoding_errors option in pandas.DataFrame.to_csv (#27750) Aug 15, 2019

shigemk2 force-pushed the to_csv_error branch 2 times, most recently from ff42832 to ea27b7f Compare August 15, 2019 13:55

TomAugspurger reviewed Aug 19, 2019

View reviewed changes

TomAugspurger approved these changes Aug 19, 2019

View reviewed changes

jreback requested changes Aug 19, 2019

View reviewed changes

pandas/core/generic.py Show resolved Hide resolved

pandas/io/common.py Show resolved Hide resolved

pandas/tests/io/formats/test_to_csv.py Show resolved Hide resolved

shigemk2 force-pushed the to_csv_error branch from ea27b7f to 55e572d Compare August 19, 2019 11:59

shigemk2 force-pushed the to_csv_error branch from 55e572d to 1e01245 Compare August 19, 2019 12:50

shigemk2 force-pushed the to_csv_error branch from 1e01245 to 959eee3 Compare August 19, 2019 13:16

TomAugspurger reviewed Aug 19, 2019

View reviewed changes

doc/source/user_guide/io.rst Outdated Show resolved Hide resolved

shigemk2 force-pushed the to_csv_error branch from 959eee3 to 3bbb6f3 Compare August 20, 2019 04:34

shigemk2 force-pushed the to_csv_error branch from 3bbb6f3 to b4f6929 Compare September 2, 2019 08:55

TomAugspurger mentioned this pull request Sep 10, 2019

File options for IO methods #28377

Open

TomAugspurger closed this Sep 10, 2019

TomAugspurger mentioned this pull request Mar 16, 2020

BUG: Add errors argument to to_csv() call to enable error handling for encoders #32702

Merged

5 tasks

davidleejy mentioned this pull request Jan 7, 2021

ENH: Add encoding errors option in pandas.read_csv #39017

Closed

Uh oh!

EHN: Add encoding_errors option in pandas.DataFrame.to_csv (#27750) #27899

EHN: Add encoding_errors option in pandas.DataFrame.to_csv (#27750) #27899

Uh oh!

Conversation

shigemk2 commented Aug 13, 2019

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pep8speaks commented Aug 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-09-02 08:56:00 UTC

Uh oh!

Uh oh!

Uh oh!

shigemk2 commented Aug 19, 2019

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shigemk2 commented Aug 19, 2019

Uh oh!

TomAugspurger commented Aug 19, 2019 via email

Uh oh!

shigemk2 commented Aug 19, 2019

Uh oh!

TomAugspurger commented Aug 19, 2019

Uh oh!

shigemk2 commented Aug 19, 2019

Uh oh!

Uh oh!

shigemk2 commented Aug 22, 2019

Uh oh!

WillAyd commented Aug 23, 2019

Uh oh!

TomAugspurger commented Aug 23, 2019

Uh oh!

WillAyd commented Aug 23, 2019

Uh oh!

shigemk2 commented Aug 23, 2019

Uh oh!

shigemk2 commented Sep 2, 2019

Uh oh!

TomAugspurger commented Sep 3, 2019

Uh oh!

WillAyd commented Sep 3, 2019

Uh oh!

TomAugspurger commented Sep 3, 2019

Uh oh!

WillAyd commented Sep 3, 2019

Uh oh!

TomAugspurger commented Sep 4, 2019

Uh oh!

shigemk2 commented Sep 5, 2019

Uh oh!

TomAugspurger commented Sep 5, 2019 via email

Uh oh!

jreback commented Sep 5, 2019

Uh oh!

WillAyd commented Sep 5, 2019

Uh oh!

shigemk2 commented Sep 8, 2019

Uh oh!

TomAugspurger commented Sep 10, 2019

Uh oh!

Uh oh!

pep8speaks commented Aug 15, 2019 •

edited

Loading