-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: encoding error in to_csv compression #21300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
fdd3ce9
handle encoding type
minggli 826aa2c
enrich comment
minggli fac53c0
add encoding fixture
minggli a4de620
remove redundant import
minggli b833abd
add encoding fixture
minggli b625f08
handle PY2 differently
minggli 9d5c25b
restore original test case
minggli 8ed6fa2
update whatsnew
minggli 2d48d10
default bytes strings unless requires unicode string
minggli 4a6f5ff
assert filehandle open
minggli bf4225c
use buffer and avoid roundtrip and encoding error.
minggli 486f3ff
update whatsnew
minggli d8435ef
minor refactor of tests
minggli 16cc951
clearer categorisation of input type
minggli 6714e68
more compact if statements
minggli f73f9ff
documentation
minggli f891533
add comments on path_or_buf types
minggli f3c3ea7
remove close
minggli 4e914ff
test file-handle to_csv with compression and encoding
minggli File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -919,29 +919,45 @@ def test_to_csv_path_is_none(self): | |
recons = pd.read_csv(StringIO(csv_str), index_col=0) | ||
assert_frame_equal(self.frame, recons) | ||
|
||
def test_to_csv_compression(self, compression): | ||
|
||
df = DataFrame([[0.123456, 0.234567, 0.567567], | ||
[12.32112, 123123.2, 321321.2]], | ||
index=['A', 'B'], columns=['X', 'Y', 'Z']) | ||
@pytest.mark.parametrize('df,encoding', [ | ||
(DataFrame([[0.123456, 0.234567, 0.567567], | ||
[12.32112, 123123.2, 321321.2]], | ||
index=['A', 'B'], columns=['X', 'Y', 'Z']), None), | ||
# GH 21241, 21118 | ||
(DataFrame([['abc', 'def', 'ghi']], columns=['X', 'Y', 'Z']), 'ascii'), | ||
(DataFrame(5 * [[123, u"你好", u"世界"]], | ||
columns=['X', 'Y', 'Z']), 'gb2312'), | ||
(DataFrame(5 * [[123, u"Γειά σου", u"Κόσμε"]], | ||
columns=['X', 'Y', 'Z']), 'cp737') | ||
]) | ||
def test_to_csv_compression(self, df, encoding, compression): | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment as below, add the issue numbers as comments |
||
with ensure_clean() as filename: | ||
|
||
df.to_csv(filename, compression=compression) | ||
df.to_csv(filename, compression=compression, encoding=encoding) | ||
|
||
# test the round trip - to_csv -> read_csv | ||
rs = read_csv(filename, compression=compression, | ||
index_col=0) | ||
assert_frame_equal(df, rs) | ||
result = read_csv(filename, compression=compression, | ||
index_col=0, encoding=encoding) | ||
|
||
with open(filename, 'w') as fh: | ||
df.to_csv(fh, compression=compression, encoding=encoding) | ||
|
||
result_fh = read_csv(filename, compression=compression, | ||
index_col=0, encoding=encoding) | ||
assert_frame_equal(df, result) | ||
assert_frame_equal(df, result_fh) | ||
|
||
# explicitly make sure file is compressed | ||
with tm.decompress_file(filename, compression) as fh: | ||
text = fh.read().decode('utf8') | ||
text = fh.read().decode(encoding or 'utf8') | ||
for col in df.columns: | ||
assert col in text | ||
|
||
with tm.decompress_file(filename, compression) as fh: | ||
assert_frame_equal(df, read_csv(fh, index_col=0)) | ||
assert_frame_equal(df, read_csv(fh, | ||
index_col=0, | ||
encoding=encoding)) | ||
|
||
def test_to_csv_date_format(self): | ||
with ensure_clean('__tmp_to_csv_date_format__') as path: | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -137,29 +137,45 @@ def test_to_csv_path_is_none(self): | |
csv_str = s.to_csv(path=None) | ||
assert isinstance(csv_str, str) | ||
|
||
def test_to_csv_compression(self, compression): | ||
|
||
s = Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'], | ||
name='X') | ||
@pytest.mark.parametrize('s,encoding', [ | ||
(Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'], | ||
name='X'), None), | ||
# GH 21241, 21118 | ||
(Series(['abc', 'def', 'ghi'], name='X'), 'ascii'), | ||
(Series(["123", u"你好", u"世界"], name=u"中文"), 'gb2312'), | ||
(Series(["123", u"Γειά σου", u"Κόσμε"], name=u"Ελληνικά"), 'cp737') | ||
]) | ||
def test_to_csv_compression(self, s, encoding, compression): | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add the releveant issues here as comments (e.g. the issues you are closing) |
||
with ensure_clean() as filename: | ||
|
||
s.to_csv(filename, compression=compression, header=True) | ||
s.to_csv(filename, compression=compression, encoding=encoding, | ||
header=True) | ||
|
||
# test the round trip - to_csv -> read_csv | ||
rs = pd.read_csv(filename, compression=compression, | ||
index_col=0, squeeze=True) | ||
assert_series_equal(s, rs) | ||
result = pd.read_csv(filename, compression=compression, | ||
encoding=encoding, index_col=0, squeeze=True) | ||
|
||
with open(filename, 'w') as fh: | ||
s.to_csv(fh, compression=compression, encoding=encoding, | ||
header=True) | ||
|
||
result_fh = pd.read_csv(filename, compression=compression, | ||
encoding=encoding, index_col=0, | ||
squeeze=True) | ||
assert_series_equal(s, result) | ||
assert_series_equal(s, result_fh) | ||
|
||
# explicitly ensure file was compressed | ||
with tm.decompress_file(filename, compression) as fh: | ||
text = fh.read().decode('utf8') | ||
text = fh.read().decode(encoding or 'utf8') | ||
assert s.name in text | ||
|
||
with tm.decompress_file(filename, compression) as fh: | ||
assert_series_equal(s, pd.read_csv(fh, | ||
index_col=0, | ||
squeeze=True)) | ||
squeeze=True, | ||
encoding=encoding)) | ||
|
||
|
||
class TestSeriesIO(TestData): | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually required? If we already have a handle aren't we just round tripping here anyway by converting the handle to a name then back to a handle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I think so it's required because we need to generate a new handle from _get_handle (potentially with compression and encoding requirement) when a external handle is passed. currently when a file handle is passed, no compression is done even when specified #21227
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't we leaving a file handle open then? To illustrate:
AFAICT with your code you eventually would close the second object but the first would stay open, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first is up to the user to call f.close() no? hope below is a good example.
https://github.com/pandas-dev/pandas/pull/21300/files#diff-a29a0a64b2800c1c30a53a178d871645
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK I see your point, but do we have test coverage for all of the branches? It just seems a little strange to me that we are opening two handles to the same location when a handle gets passed to this function. I feel like that's asking for some strange behavior to pop up, but would feel more comfortable if the test coverage was there to hit all three branches. lmk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see your point. though two file handles have separate BufferIO and states. added tests using file-handle as input and assert equivalence. we also have compress_size test with file-handle showing they fix #21227
we already have tests for path and buffer IO as they are existing use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the issue #21561 we have a bug if the file is
sys.stdout
. Basically the library creates a file named<stdout>
and write stuff in it