Skip to content

BUG: encoding error in to_csv compression #21300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jun 5, 2018

Conversation

minggli
Copy link
Contributor

@minggli minggli commented Jun 3, 2018

Fix a problem where encoding wasn't handled properly in to_csv compression in Python 3. It was caused by dumping uncompressed csv on disk and reading it back to memory without passing specified encoding. Then platform tried to decode using default locale encoding which may or may not succeed.

This PR added tests for non-ascii data with csv compression. Also, by using string buffer, this PR removed repeated disk roundtrip and redundant encoding/decoding which caused UnicodeDecodeError. There is performance improvement compared to 0.22 and 0.23. It also supports file-like object as input path_or_buf.

Before this PR:

>>> df = DataFrame(100 * [[123, "abc", u"样本1", u"样本2"]], columns=['A', 'B', 'C', 'D'])
>>>
>>> def test_to_csv(df):
...     df.to_csv(
...         path_or_buf='test',
...         encoding='utf8',
...         compression='zip',
...         quoting=1,
...         sep='\t',
...         index=False)
...
>>> timeit(lambda: test_to_csv(df), number=5000)
11.856349980007508

After this PR:

>>> df = DataFrame(100 * [[123, "abc", u"样本1", u"样本2"]], columns=['A', 'B', 'C', 'D'])
>>>
>>> def test_to_csv(df):
...     df.to_csv(
...         path_or_buf='test',
...         encoding='utf8',
...         compression='zip',
...         quoting=1,
...         sep='\t',
...         index=False)
...
>>> timeit(lambda: test_to_csv(df), number=5000)
5.459916951993364

@minggli minggli force-pushed the bugfix/to_csv_encoding branch from 47fc509 to 9d5c25b Compare June 3, 2018 10:11
@codecov
Copy link

codecov bot commented Jun 3, 2018

Codecov Report

Merging #21300 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21300      +/-   ##
==========================================
+ Coverage   91.85%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49546    49549       +3     
==========================================
+ Hits        45509    45512       +3     
  Misses       4037     4037
Flag Coverage Δ
#multiple 90.25% <100%> (ø) ⬆️
#single 41.87% <57.14%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/io/formats/csvs.py 98.14% <100%> (+0.01%) ⬆️
pandas/core/indexes/interval.py 93.16% <0%> (+0.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4274b84...4e914ff. Read the comment docs.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple minor edits but otherwise lgtm. Can you see if this also fixes #21118?

df = DataFrame([[0.123456, 0.234567, 0.567567],
[12.32112, 123123.2, 321321.2]],
index=['A', 'B'], columns=['X', 'Y', 'Z'])
@pytest.mark.parametrize('frame, encoding', [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use df instead of frame and remove the whitespace between the parameter names here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.


s = Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'],
name='X')
@pytest.mark.parametrize('s, encoding', [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove whitespace between parameter names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -92,6 +92,7 @@ I/O

- Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`)
- Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`)
- Bug in :meth:`DataFrame.to_csv` using compression causes encoding error (:issue:`21241`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also mention Series.to_csv here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.


with ensure_clean() as filename:

df.to_csv(filename, compression=compression)
frame.to_csv(filename, compression=compression, encoding=encoding)

# test the round trip - to_csv -> read_csv
rs = read_csv(filename, compression=compression,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change rs to result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@jreback jreback added the IO CSV read_csv, to_csv label Jun 3, 2018
@jreback jreback added this to the 0.23.1 milestone Jun 3, 2018
@@ -92,6 +92,7 @@ I/O

- Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`)
- Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`)
- Bug in :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` using compression causes encoding error (:issue:`21241`, :issue:`21118`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is when encoding & compression are specified yes? (pls say that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

if hasattr(self.path_or_buf, 'write'):
# PR 21300 uses string buffer to receive csv writing and dump into
# file-like output with compression as option.
if not is_file_like(self.path_or_buf):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to combine these first 2 cases? (just or the condition)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly the same but made it more compact.

(Series(["123", u"你好", u"世界"], name=u"中文"), 'gb2312'),
(Series(["123", u"Γειά σου", u"Κόσμε"], name=u"Ελληνικά"), 'cp737')
])
def test_to_csv_compression(self, s, encoding, compression):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the releveant issues here as comments (e.g. the issues you are closing)

(DataFrame(5 * [[123, u"Γειά σου", u"Κόσμε"]],
columns=['X', 'Y', 'Z']), 'cp737')
])
def test_to_csv_compression(self, df, encoding, compression):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as below, add the issue numbers as comments

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this looks good
just make sure that we don’t have any resource unclosed (we have a couple now in the 3.6 tests)

@minggli
Copy link
Contributor Author

minggli commented Jun 4, 2018

@WillAyd there had been changes since you last reviewed it. comments welcome.

# file-like output with compression as option. GH 21241, 21118
f = StringIO()
close = True
if not is_file_like(self.path_or_buf):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What types of objects would you expect to go through the particular branches here? Comments would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this helps. So while not as explicitly required to call close with IO objects I suppose there isn't much harm in doing so either.

Is it possible to simplify the code if we got rid of the close variable altogether and just lived with that being called for IO objects?

Copy link
Contributor Author

@minggli minggli Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good point. though I feel we probably need to keep it to write an output file-object or not.

f, handles = _get_handle(self.path_or_buf, self.mode,
encoding=encoding,
compression=None)
close = True if self.compression is None else False

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this change but ideally should have an explicit except here (can do in separate PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great if we can leave it in a separate PR. Don't want to overcomplicate things.

if not is_file_like(self.path_or_buf):
# path_or_buf is path
path_or_buf = self.path_or_buf
elif hasattr(self.path_or_buf, 'name'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually required? If we already have a handle aren't we just round tripping here anyway by converting the handle to a name then back to a handle?

Copy link
Contributor Author

@minggli minggli Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think so it's required because we need to generate a new handle from _get_handle (potentially with compression and encoding requirement) when a external handle is passed. currently when a file handle is passed, no compression is done even when specified #21227

Copy link
Member

@WillAyd WillAyd Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we leaving a file handle open then? To illustrate:

>>> f1 = open('foo.txt')
>>> name = f1.name
>>> f2, handles = _get_handle(name, 'r')
>>> f2.close()
>>> f1.closed
False
>>> f2.closed
True

AFAICT with your code you eventually would close the second object but the first would stay open, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first is up to the user to call f.close() no? hope below is a good example.

https://github.com/pandas-dev/pandas/pull/21300/files#diff-a29a0a64b2800c1c30a53a178d871645

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK I see your point, but do we have test coverage for all of the branches? It just seems a little strange to me that we are opening two handles to the same location when a handle gets passed to this function. I feel like that's asking for some strange behavior to pop up, but would feel more comfortable if the test coverage was there to hit all three branches. lmk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see your point. though two file handles have separate BufferIO and states. added tests using file-handle as input and assert equivalence. we also have compress_size test with file-handle showing they fix #21227

we already have tests for path and buffer IO as they are existing use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the issue #21561 we have a bug if the file is sys.stdout. Basically the library creates a file named <stdout>and write stuff in it

@jreback
Copy link
Contributor

jreback commented Jun 4, 2018

looks ok to me. @WillAyd pls merge when satisfied.

@WillAyd WillAyd merged commit b32fdc4 into pandas-dev:master Jun 5, 2018
@WillAyd
Copy link
Member

WillAyd commented Jun 5, 2018

Thanks @minggli !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue with compression in to_csv method to_csv failing with encoding='utf-16'
5 participants