-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: encoding error in to_csv compression #21300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 16 commits
fdd3ce9
826aa2c
fac53c0
a4de620
b833abd
b625f08
9d5c25b
8ed6fa2
2d48d10
4a6f5ff
bf4225c
486f3ff
d8435ef
16cc951
6714e68
f73f9ff
f891533
f3c3ea7
4e914ff
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ | |
import numpy as np | ||
|
||
from pandas.core.dtypes.missing import notna | ||
from pandas.core.dtypes.inference import is_file_like | ||
from pandas.core.index import Index, MultiIndex | ||
from pandas import compat | ||
from pandas.compat import (StringIO, range, zip) | ||
|
@@ -127,14 +128,17 @@ def save(self): | |
else: | ||
encoding = self.encoding | ||
|
||
if hasattr(self.path_or_buf, 'write'): | ||
# PR 21300 uses string buffer to receive csv writing and dump into | ||
# file-like output with compression as option. GH 21241, 21118 | ||
f = StringIO() | ||
close = True | ||
if not is_file_like(self.path_or_buf): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What types of objects would you expect to go through the particular branches here? Comments would be helpful There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sure There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks this helps. So while not as explicitly required to call close with IO objects I suppose there isn't much harm in doing so either. Is it possible to simplify the code if we got rid of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. very good point. though I feel we probably need to keep it to write an output file-object or not. |
||
path_or_buf = self.path_or_buf | ||
elif hasattr(self.path_or_buf, 'name'): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this actually required? If we already have a handle aren't we just round tripping here anyway by converting the handle to a name then back to a handle? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi, I think so it's required because we need to generate a new handle from _get_handle (potentially with compression and encoding requirement) when a external handle is passed. currently when a file handle is passed, no compression is done even when specified #21227 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aren't we leaving a file handle open then? To illustrate: >>> f1 = open('foo.txt')
>>> name = f1.name
>>> f2, handles = _get_handle(name, 'r')
>>> f2.close()
>>> f1.closed
False
>>> f2.closed
True AFAICT with your code you eventually would close the second object but the first would stay open, no? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the first is up to the user to call f.close() no? hope below is a good example. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm OK I see your point, but do we have test coverage for all of the branches? It just seems a little strange to me that we are opening two handles to the same location when a handle gets passed to this function. I feel like that's asking for some strange behavior to pop up, but would feel more comfortable if the test coverage was there to hit all three branches. lmk There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see your point. though two file handles have separate BufferIO and states. added tests using file-handle as input and assert equivalence. we also have compress_size test with file-handle showing they fix #21227 we already have tests for path and buffer IO as they are existing use cases. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to the issue #21561 we have a bug if the file is |
||
path_or_buf = self.path_or_buf.name | ||
else: | ||
f = self.path_or_buf | ||
close = False | ||
else: | ||
f, handles = _get_handle(self.path_or_buf, self.mode, | ||
encoding=encoding, | ||
compression=None) | ||
close = True if self.compression is None else False | ||
|
||
try: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not related to this change but ideally should have an explicit There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would be great if we can leave it in a separate PR. Don't want to overcomplicate things. |
||
writer_kwargs = dict(lineterminator=self.line_terminator, | ||
|
@@ -151,18 +155,16 @@ def save(self): | |
self._save() | ||
|
||
finally: | ||
# GH 17778 handles compression for byte strings. | ||
if not close and self.compression: | ||
f.close() | ||
with open(f.name, 'r') as f: | ||
data = f.read() | ||
f, handles = _get_handle(f.name, self.mode, | ||
# GH 17778 handles zip compression for byte strings separately. | ||
buf = f.getvalue() | ||
if close: | ||
f, handles = _get_handle(path_or_buf, self.mode, | ||
encoding=encoding, | ||
compression=self.compression) | ||
f.write(data) | ||
close = True | ||
if close: | ||
f.write(buf) | ||
f.close() | ||
for _fh in handles: | ||
_fh.close() | ||
|
||
def _save_header(self): | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -919,29 +919,38 @@ def test_to_csv_path_is_none(self): | |
recons = pd.read_csv(StringIO(csv_str), index_col=0) | ||
assert_frame_equal(self.frame, recons) | ||
|
||
def test_to_csv_compression(self, compression): | ||
|
||
df = DataFrame([[0.123456, 0.234567, 0.567567], | ||
[12.32112, 123123.2, 321321.2]], | ||
index=['A', 'B'], columns=['X', 'Y', 'Z']) | ||
@pytest.mark.parametrize('df,encoding', [ | ||
(DataFrame([[0.123456, 0.234567, 0.567567], | ||
[12.32112, 123123.2, 321321.2]], | ||
index=['A', 'B'], columns=['X', 'Y', 'Z']), None), | ||
# GH 21241, 21118 | ||
(DataFrame([['abc', 'def', 'ghi']], columns=['X', 'Y', 'Z']), 'ascii'), | ||
(DataFrame(5 * [[123, u"你好", u"世界"]], | ||
columns=['X', 'Y', 'Z']), 'gb2312'), | ||
(DataFrame(5 * [[123, u"Γειά σου", u"Κόσμε"]], | ||
columns=['X', 'Y', 'Z']), 'cp737') | ||
]) | ||
def test_to_csv_compression(self, df, encoding, compression): | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment as below, add the issue numbers as comments |
||
with ensure_clean() as filename: | ||
|
||
df.to_csv(filename, compression=compression) | ||
df.to_csv(filename, compression=compression, encoding=encoding) | ||
|
||
# test the round trip - to_csv -> read_csv | ||
rs = read_csv(filename, compression=compression, | ||
index_col=0) | ||
assert_frame_equal(df, rs) | ||
result = read_csv(filename, compression=compression, | ||
index_col=0, encoding=encoding) | ||
assert_frame_equal(df, result) | ||
|
||
# explicitly make sure file is compressed | ||
with tm.decompress_file(filename, compression) as fh: | ||
text = fh.read().decode('utf8') | ||
text = fh.read().decode(encoding or 'utf8') | ||
for col in df.columns: | ||
assert col in text | ||
|
||
with tm.decompress_file(filename, compression) as fh: | ||
assert_frame_equal(df, read_csv(fh, index_col=0)) | ||
assert_frame_equal(df, read_csv(fh, | ||
index_col=0, | ||
encoding=encoding)) | ||
|
||
def test_to_csv_date_format(self): | ||
with ensure_clean('__tmp_to_csv_date_format__') as path: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -137,29 +137,36 @@ def test_to_csv_path_is_none(self): | |
csv_str = s.to_csv(path=None) | ||
assert isinstance(csv_str, str) | ||
|
||
def test_to_csv_compression(self, compression): | ||
|
||
s = Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'], | ||
name='X') | ||
@pytest.mark.parametrize('s,encoding', [ | ||
(Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'], | ||
name='X'), None), | ||
# GH 21241, 21118 | ||
(Series(['abc', 'def', 'ghi'], name='X'), 'ascii'), | ||
(Series(["123", u"你好", u"世界"], name=u"中文"), 'gb2312'), | ||
(Series(["123", u"Γειά σου", u"Κόσμε"], name=u"Ελληνικά"), 'cp737') | ||
]) | ||
def test_to_csv_compression(self, s, encoding, compression): | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add the releveant issues here as comments (e.g. the issues you are closing) |
||
with ensure_clean() as filename: | ||
|
||
s.to_csv(filename, compression=compression, header=True) | ||
s.to_csv(filename, compression=compression, encoding=encoding, | ||
header=True) | ||
|
||
# test the round trip - to_csv -> read_csv | ||
rs = pd.read_csv(filename, compression=compression, | ||
index_col=0, squeeze=True) | ||
assert_series_equal(s, rs) | ||
result = pd.read_csv(filename, compression=compression, | ||
encoding=encoding, index_col=0, squeeze=True) | ||
assert_series_equal(s, result) | ||
|
||
# explicitly ensure file was compressed | ||
with tm.decompress_file(filename, compression) as fh: | ||
text = fh.read().decode('utf8') | ||
text = fh.read().decode(encoding or 'utf8') | ||
assert s.name in text | ||
|
||
with tm.decompress_file(filename, compression) as fh: | ||
assert_series_equal(s, pd.read_csv(fh, | ||
index_col=0, | ||
squeeze=True)) | ||
squeeze=True, | ||
encoding=encoding)) | ||
|
||
|
||
class TestSeriesIO(TestData): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to combine these first 2 cases? (just or the condition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not exactly the same but made it more compact.