-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: encoding error in to_csv compression #21300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
47fc509
to
9d5c25b
Compare
Codecov Report
@@ Coverage Diff @@
## master #21300 +/- ##
==========================================
+ Coverage 91.85% 91.85% +<.01%
==========================================
Files 153 153
Lines 49546 49549 +3
==========================================
+ Hits 45509 45512 +3
Misses 4037 4037
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple minor edits but otherwise lgtm. Can you see if this also fixes #21118?
pandas/tests/frame/test_to_csv.py
Outdated
df = DataFrame([[0.123456, 0.234567, 0.567567], | ||
[12.32112, 123123.2, 321321.2]], | ||
index=['A', 'B'], columns=['X', 'Y', 'Z']) | ||
@pytest.mark.parametrize('frame, encoding', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use df
instead of frame
and remove the whitespace between the parameter names here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
pandas/tests/series/test_io.py
Outdated
|
||
s = Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'], | ||
name='X') | ||
@pytest.mark.parametrize('s, encoding', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove whitespace between parameter names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
doc/source/whatsnew/v0.23.1.txt
Outdated
@@ -92,6 +92,7 @@ I/O | |||
|
|||
- Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`) | |||
- Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`) | |||
- Bug in :meth:`DataFrame.to_csv` using compression causes encoding error (:issue:`21241`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also mention Series.to_csv here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
pandas/tests/frame/test_to_csv.py
Outdated
|
||
with ensure_clean() as filename: | ||
|
||
df.to_csv(filename, compression=compression) | ||
frame.to_csv(filename, compression=compression, encoding=encoding) | ||
|
||
# test the round trip - to_csv -> read_csv | ||
rs = read_csv(filename, compression=compression, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change rs
to result
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
doc/source/whatsnew/v0.23.1.txt
Outdated
@@ -92,6 +92,7 @@ I/O | |||
|
|||
- Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`) | |||
- Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`) | |||
- Bug in :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` using compression causes encoding error (:issue:`21241`, :issue:`21118`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is when encoding & compression are specified yes? (pls say that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
pandas/io/formats/csvs.py
Outdated
if hasattr(self.path_or_buf, 'write'): | ||
# PR 21300 uses string buffer to receive csv writing and dump into | ||
# file-like output with compression as option. | ||
if not is_file_like(self.path_or_buf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to combine these first 2 cases? (just or the condition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not exactly the same but made it more compact.
(Series(["123", u"你好", u"世界"], name=u"中文"), 'gb2312'), | ||
(Series(["123", u"Γειά σου", u"Κόσμε"], name=u"Ελληνικά"), 'cp737') | ||
]) | ||
def test_to_csv_compression(self, s, encoding, compression): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the releveant issues here as comments (e.g. the issues you are closing)
(DataFrame(5 * [[123, u"Γειά σου", u"Κόσμε"]], | ||
columns=['X', 'Y', 'Z']), 'cp737') | ||
]) | ||
def test_to_csv_compression(self, df, encoding, compression): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as below, add the issue numbers as comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this looks good
just make sure that we don’t have any resource unclosed (we have a couple now in the 3.6 tests)
@WillAyd there had been changes since you last reviewed it. comments welcome. |
pandas/io/formats/csvs.py
Outdated
# file-like output with compression as option. GH 21241, 21118 | ||
f = StringIO() | ||
close = True | ||
if not is_file_like(self.path_or_buf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What types of objects would you expect to go through the particular branches here? Comments would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this helps. So while not as explicitly required to call close with IO objects I suppose there isn't much harm in doing so either.
Is it possible to simplify the code if we got rid of the close
variable altogether and just lived with that being called for IO objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very good point. though I feel we probably need to keep it to write an output file-object or not.
pandas/io/formats/csvs.py
Outdated
f, handles = _get_handle(self.path_or_buf, self.mode, | ||
encoding=encoding, | ||
compression=None) | ||
close = True if self.compression is None else False | ||
|
||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this change but ideally should have an explicit except
here (can do in separate PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great if we can leave it in a separate PR. Don't want to overcomplicate things.
if not is_file_like(self.path_or_buf): | ||
# path_or_buf is path | ||
path_or_buf = self.path_or_buf | ||
elif hasattr(self.path_or_buf, 'name'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually required? If we already have a handle aren't we just round tripping here anyway by converting the handle to a name then back to a handle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I think so it's required because we need to generate a new handle from _get_handle (potentially with compression and encoding requirement) when a external handle is passed. currently when a file handle is passed, no compression is done even when specified #21227
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we leaving a file handle open then? To illustrate:
>>> f1 = open('foo.txt')
>>> name = f1.name
>>> f2, handles = _get_handle(name, 'r')
>>> f2.close()
>>> f1.closed
False
>>> f2.closed
True
AFAICT with your code you eventually would close the second object but the first would stay open, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first is up to the user to call f.close() no? hope below is a good example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK I see your point, but do we have test coverage for all of the branches? It just seems a little strange to me that we are opening two handles to the same location when a handle gets passed to this function. I feel like that's asking for some strange behavior to pop up, but would feel more comfortable if the test coverage was there to hit all three branches. lmk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see your point. though two file handles have separate BufferIO and states. added tests using file-handle as input and assert equivalence. we also have compress_size test with file-handle showing they fix #21227
we already have tests for path and buffer IO as they are existing use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the issue #21561 we have a bug if the file is sys.stdout
. Basically the library creates a file named <stdout>
and write stuff in it
looks ok to me. @WillAyd pls merge when satisfied. |
Thanks @minggli ! |
(cherry picked from commit b32fdc4)
(cherry picked from commit b32fdc4)
closes to_csv failing with encoding='utf-16' #21118
git diff upstream/master -u -- "*.py" | flake8 --diff
Fix a problem where encoding wasn't handled properly in to_csv compression in Python 3. It was caused by dumping uncompressed csv on disk and reading it back to memory without passing specified encoding. Then platform tried to decode using default locale encoding which may or may not succeed.
This PR added tests for non-ascii data with csv compression. Also, by using string buffer, this PR removed repeated disk roundtrip and redundant encoding/decoding which caused UnicodeDecodeError. There is performance improvement compared to 0.22 and 0.23. It also supports file-like object as input path_or_buf.
Before this PR:
After this PR: