BUG: encoding error in to_csv compression #21300

minggli · 2018-06-03T10:03:06Z

closes Issue with compression in to_csv method #21241
closes to_csv failing with encoding='utf-16' #21118
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Fix a problem where encoding wasn't handled properly in to_csv compression in Python 3. It was caused by dumping uncompressed csv on disk and reading it back to memory without passing specified encoding. Then platform tried to decode using default locale encoding which may or may not succeed.

This PR added tests for non-ascii data with csv compression. Also, by using string buffer, this PR removed repeated disk roundtrip and redundant encoding/decoding which caused UnicodeDecodeError. There is performance improvement compared to 0.22 and 0.23. It also supports file-like object as input path_or_buf.

Before this PR:

>>> df = DataFrame(100 * [[123, "abc", u"样本1", u"样本2"]], columns=['A', 'B', 'C', 'D'])
>>>
>>> def test_to_csv(df):
...     df.to_csv(
...         path_or_buf='test',
...         encoding='utf8',
...         compression='zip',
...         quoting=1,
...         sep='\t',
...         index=False)
...
>>> timeit(lambda: test_to_csv(df), number=5000)
11.856349980007508

After this PR:

>>> df = DataFrame(100 * [[123, "abc", u"样本1", u"样本2"]], columns=['A', 'B', 'C', 'D'])
>>>
>>> def test_to_csv(df):
...     df.to_csv(
...         path_or_buf='test',
...         encoding='utf8',
...         compression='zip',
...         quoting=1,
...         sep='\t',
...         index=False)
...
>>> timeit(lambda: test_to_csv(df), number=5000)
5.459916951993364

codecov · 2018-06-03T11:02:54Z

Codecov Report

Merging #21300 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21300      +/-   ##
==========================================
+ Coverage   91.85%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49546    49549       +3     
==========================================
+ Hits        45509    45512       +3     
  Misses       4037     4037

Flag	Coverage Δ
#multiple	`90.25% <100%> (ø)`	⬆️
#single	`41.87% <57.14%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/io/formats/csvs.py	`98.14% <100%> (+0.01%)`	⬆️
pandas/core/indexes/interval.py	`93.16% <0%> (+0.02%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4274b84...4e914ff. Read the comment docs.

WillAyd

Couple minor edits but otherwise lgtm. Can you see if this also fixes #21118?

WillAyd · 2018-06-03T18:31:15Z

pandas/tests/frame/test_to_csv.py

-        df = DataFrame([[0.123456, 0.234567, 0.567567],
-                        [12.32112, 123123.2, 321321.2]],
-                       index=['A', 'B'], columns=['X', 'Y', 'Z'])
+    @pytest.mark.parametrize('frame, encoding', [


Use df instead of frame and remove the whitespace between the parameter names here

WillAyd · 2018-06-03T18:31:25Z

pandas/tests/series/test_io.py

-
-        s = Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'],
-                   name='X')
+    @pytest.mark.parametrize('s, encoding', [


Remove whitespace between parameter names

WillAyd · 2018-06-03T18:33:07Z

doc/source/whatsnew/v0.23.1.txt

@@ -92,6 +92,7 @@ I/O

 - Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`)
 - Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`)
+- Bug in :meth:`DataFrame.to_csv` using compression causes encoding error (:issue:`21241`)


Could also mention Series.to_csv here

WillAyd · 2018-06-03T18:33:47Z

pandas/tests/frame/test_to_csv.py


        with ensure_clean() as filename:

-            df.to_csv(filename, compression=compression)
+            frame.to_csv(filename, compression=compression, encoding=encoding)

            # test the round trip - to_csv -> read_csv
            rs = read_csv(filename, compression=compression,


Can you change rs to result?

jreback · 2018-06-03T22:42:59Z

doc/source/whatsnew/v0.23.1.txt

@@ -92,6 +92,7 @@ I/O

 - Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`)
 - Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`)
+- Bug in :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` using compression causes encoding error (:issue:`21241`, :issue:`21118`)


this is when encoding & compression are specified yes? (pls say that)

jreback · 2018-06-03T22:43:56Z

pandas/io/formats/csvs.py

-        if hasattr(self.path_or_buf, 'write'):
+        # PR 21300 uses string buffer to receive csv writing and dump into
+        # file-like output with compression as option.
+        if not is_file_like(self.path_or_buf):


is it possible to combine these first 2 cases? (just or the condition)

not exactly the same but made it more compact.

jreback · 2018-06-03T22:45:13Z

pandas/tests/series/test_io.py

+        (Series(["123", u"你好", u"世界"], name=u"中文"), 'gb2312'),
+        (Series(["123", u"Γειά σου", u"Κόσμε"], name=u"Ελληνικά"), 'cp737')
+    ])
+    def test_to_csv_compression(self, s, encoding, compression):



can you add the releveant issues here as comments (e.g. the issues you are closing)

jreback · 2018-06-03T22:45:30Z

pandas/tests/frame/test_to_csv.py

+        (DataFrame(5 * [[123, u"Γειά σου", u"Κόσμε"]],
+                   columns=['X', 'Y', 'Z']), 'cp737')
+    ])
+    def test_to_csv_compression(self, df, encoding, compression):



same comment as below, add the issue numbers as comments

jreback

ok this looks good
just make sure that we don’t have any resource unclosed (we have a couple now in the 3.6 tests)

minggli · 2018-06-04T12:52:26Z

@WillAyd there had been changes since you last reviewed it. comments welcome.

WillAyd · 2018-06-04T15:59:48Z

pandas/io/formats/csvs.py

+        # file-like output with compression as option. GH 21241, 21118
+        f = StringIO()
+        close = True
+        if not is_file_like(self.path_or_buf):


What types of objects would you expect to go through the particular branches here? Comments would be helpful

Thanks this helps. So while not as explicitly required to call close with IO objects I suppose there isn't much harm in doing so either.

Is it possible to simplify the code if we got rid of the close variable altogether and just lived with that being called for IO objects?

very good point. though I feel we probably need to keep it to write an output file-object or not.

WillAyd · 2018-06-04T16:00:21Z

pandas/io/formats/csvs.py

-            f, handles = _get_handle(self.path_or_buf, self.mode,
-                                     encoding=encoding,
-                                     compression=None)
-            close = True if self.compression is None else False

        try:


Not related to this change but ideally should have an explicit except here (can do in separate PR)

That would be great if we can leave it in a separate PR. Don't want to overcomplicate things.

WillAyd · 2018-06-04T18:48:35Z

pandas/io/formats/csvs.py

+        if not is_file_like(self.path_or_buf):
+            # path_or_buf is path
+            path_or_buf = self.path_or_buf
+        elif hasattr(self.path_or_buf, 'name'):


Is this actually required? If we already have a handle aren't we just round tripping here anyway by converting the handle to a name then back to a handle?

Hi, I think so it's required because we need to generate a new handle from _get_handle (potentially with compression and encoding requirement) when a external handle is passed. currently when a file handle is passed, no compression is done even when specified #21227

Aren't we leaving a file handle open then? To illustrate:

>>> f1 = open('foo.txt') >>> name = f1.name >>> f2, handles = _get_handle(name, 'r') >>> f2.close() >>> f1.closed False >>> f2.closed True

AFAICT with your code you eventually would close the second object but the first would stay open, no?

the first is up to the user to call f.close() no? hope below is a good example.

https://github.com/pandas-dev/pandas/pull/21300/files#diff-a29a0a64b2800c1c30a53a178d871645

Hmm OK I see your point, but do we have test coverage for all of the branches? It just seems a little strange to me that we are opening two handles to the same location when a handle gets passed to this function. I feel like that's asking for some strange behavior to pop up, but would feel more comfortable if the test coverage was there to hit all three branches. lmk

see your point. though two file handles have separate BufferIO and states. added tests using file-handle as input and assert equivalence. we also have compress_size test with file-handle showing they fix #21227

we already have tests for path and buffer IO as they are existing use cases.

According to the issue #21561 we have a bug if the file is sys.stdout. Basically the library creates a file named <stdout>and write stuff in it

jreback · 2018-06-04T21:33:47Z

looks ok to me. @WillAyd pls merge when satisfied.

WillAyd · 2018-06-05T04:54:47Z

Thanks @minggli !

(cherry picked from commit b32fdc4)

minggli added 7 commits May 31, 2018 22:08

handle encoding type

fdd3ce9

enrich comment

826aa2c

add encoding fixture

fac53c0

remove redundant import

a4de620

add encoding fixture

b833abd

handle PY2 differently

b625f08

restore original test case

9d5c25b

minggli force-pushed the bugfix/to_csv_encoding branch from 47fc509 to 9d5c25b Compare June 3, 2018 10:11

minggli added 2 commits June 3, 2018 14:26

update whatsnew

8ed6fa2

default bytes strings unless requires unicode string

2d48d10

WillAyd requested changes Jun 3, 2018

View reviewed changes

minggli added 5 commits June 3, 2018 23:04

assert filehandle open

4a6f5ff

use buffer and avoid roundtrip and encoding error.

bf4225c

update whatsnew

486f3ff

minor refactor of tests

d8435ef

clearer categorisation of input type

16cc951

jreback added the IO CSV read_csv, to_csv label Jun 3, 2018

jreback added this to the 0.23.1 milestone Jun 3, 2018

jreback requested changes Jun 3, 2018

View reviewed changes

minggli added 2 commits June 4, 2018 00:08

more compact if statements

6714e68

documentation

f73f9ff

jreback approved these changes Jun 3, 2018

View reviewed changes

WillAyd requested changes Jun 4, 2018

View reviewed changes

minggli added 2 commits June 4, 2018 18:36

add comments on path_or_buf types

f891533

remove close

f3c3ea7

WillAyd reviewed Jun 4, 2018

View reviewed changes

jreback added the Needs Backport label Jun 4, 2018

test file-handle to_csv with compression and encoding

4e914ff

WillAyd approved these changes Jun 5, 2018

View reviewed changes

WillAyd merged commit b32fdc4 into pandas-dev:master Jun 5, 2018

minggli deleted the bugfix/to_csv_encoding branch June 5, 2018 07:58

WillAyd mentioned this pull request Jun 6, 2018

Error when writing very big dataframe to csv, with gzip compression #21346

Closed

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Jun 12, 2018

BUG: Fix encoding error in to_csv compression (pandas-dev#21300)

8350429

(cherry picked from commit b32fdc4)

TomAugspurger pushed a commit that referenced this pull request Jun 12, 2018

BUG: Fix encoding error in to_csv compression (#21300)

0eb9bae

(cherry picked from commit b32fdc4)

TomAugspurger removed the Needs Backport label Jun 12, 2018

stephen-hoover mentioned this pull request Jun 13, 2018

MAINT Configure Windows CI using AppVeyor civisanalytics/civis-python#258

Merged

jacksonllee mentioned this pull request Jun 13, 2018

BUG A file-like object as arg to df.to_csv must have the method getvalue civisanalytics/civis-python#259

Merged

david-liu-brattle-1 pushed a commit to david-liu-brattle-1/pandas that referenced this pull request Jun 18, 2018

BUG: Fix encoding error in to_csv compression (pandas-dev#21300)

808545a

r00ta mentioned this pull request Jun 20, 2018

Test Case for to_csv to sys.stdout #21561

Closed

Uh oh!

BUG: encoding error in to_csv compression #21300

BUG: encoding error in to_csv compression #21300

Uh oh!

Conversation

minggli commented Jun 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

minggli commented Jun 4, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minggli Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minggli Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WillAyd Jun 4, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

minggli commented Jun 3, 2018 •

edited

Loading

codecov bot commented Jun 3, 2018 •

edited

Loading

minggli Jun 4, 2018 •

edited

Loading

minggli Jun 4, 2018 •

edited

Loading

WillAyd Jun 4, 2018 •

edited

Loading