Skip to content

BUG: encoding error in to_csv compression #21300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jun 5, 2018
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,7 @@ I/O

- Bug in IO methods specifying ``compression='zip'`` which produced uncompressed zip archives (:issue:`17778`, :issue:`21144`)
- Bug in :meth:`DataFrame.to_stata` which prevented exporting DataFrames to buffers and most file-like objects (:issue:`21041`)
- Bug in :meth:`DataFrame.to_csv` and :meth:`Series.to_csv` causes encoding error when compression and encoding are specified (:issue:`21241`, :issue:`21118`)
-

Plotting
Expand Down
32 changes: 17 additions & 15 deletions pandas/io/formats/csvs.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import numpy as np

from pandas.core.dtypes.missing import notna
from pandas.core.dtypes.inference import is_file_like
from pandas.core.index import Index, MultiIndex
from pandas import compat
from pandas.compat import (StringIO, range, zip)
Expand Down Expand Up @@ -127,14 +128,17 @@ def save(self):
else:
encoding = self.encoding

if hasattr(self.path_or_buf, 'write'):
# PR 21300 uses string buffer to receive csv writing and dump into
# file-like output with compression as option. GH 21241, 21118
f = StringIO()
close = True
if not is_file_like(self.path_or_buf):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to combine these first 2 cases? (just or the condition)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly the same but made it more compact.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What types of objects would you expect to go through the particular branches here? Comments would be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this helps. So while not as explicitly required to call close with IO objects I suppose there isn't much harm in doing so either.

Is it possible to simplify the code if we got rid of the close variable altogether and just lived with that being called for IO objects?

Copy link
Contributor Author

@minggli minggli Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good point. though I feel we probably need to keep it to write an output file-object or not.

path_or_buf = self.path_or_buf
elif hasattr(self.path_or_buf, 'name'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually required? If we already have a handle aren't we just round tripping here anyway by converting the handle to a name then back to a handle?

Copy link
Contributor Author

@minggli minggli Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think so it's required because we need to generate a new handle from _get_handle (potentially with compression and encoding requirement) when a external handle is passed. currently when a file handle is passed, no compression is done even when specified #21227

Copy link
Member

@WillAyd WillAyd Jun 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we leaving a file handle open then? To illustrate:

>>> f1 = open('foo.txt')
>>> name = f1.name
>>> f2, handles = _get_handle(name, 'r')
>>> f2.close()
>>> f1.closed
False
>>> f2.closed
True

AFAICT with your code you eventually would close the second object but the first would stay open, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first is up to the user to call f.close() no? hope below is a good example.

https://github.com/pandas-dev/pandas/pull/21300/files#diff-a29a0a64b2800c1c30a53a178d871645

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK I see your point, but do we have test coverage for all of the branches? It just seems a little strange to me that we are opening two handles to the same location when a handle gets passed to this function. I feel like that's asking for some strange behavior to pop up, but would feel more comfortable if the test coverage was there to hit all three branches. lmk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see your point. though two file handles have separate BufferIO and states. added tests using file-handle as input and assert equivalence. we also have compress_size test with file-handle showing they fix #21227

we already have tests for path and buffer IO as they are existing use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the issue #21561 we have a bug if the file is sys.stdout. Basically the library creates a file named <stdout>and write stuff in it

path_or_buf = self.path_or_buf.name
else:
f = self.path_or_buf
close = False
else:
f, handles = _get_handle(self.path_or_buf, self.mode,
encoding=encoding,
compression=None)
close = True if self.compression is None else False

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this change but ideally should have an explicit except here (can do in separate PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great if we can leave it in a separate PR. Don't want to overcomplicate things.

writer_kwargs = dict(lineterminator=self.line_terminator,
Expand All @@ -151,18 +155,16 @@ def save(self):
self._save()

finally:
# GH 17778 handles compression for byte strings.
if not close and self.compression:
f.close()
with open(f.name, 'r') as f:
data = f.read()
f, handles = _get_handle(f.name, self.mode,
# GH 17778 handles zip compression for byte strings separately.
buf = f.getvalue()
if close:
f, handles = _get_handle(path_or_buf, self.mode,
encoding=encoding,
compression=self.compression)
f.write(data)
close = True
if close:
f.write(buf)
f.close()
for _fh in handles:
_fh.close()

def _save_header(self):

Expand Down
31 changes: 20 additions & 11 deletions pandas/tests/frame/test_to_csv.py
Original file line number Diff line number Diff line change
Expand Up @@ -919,29 +919,38 @@ def test_to_csv_path_is_none(self):
recons = pd.read_csv(StringIO(csv_str), index_col=0)
assert_frame_equal(self.frame, recons)

def test_to_csv_compression(self, compression):

df = DataFrame([[0.123456, 0.234567, 0.567567],
[12.32112, 123123.2, 321321.2]],
index=['A', 'B'], columns=['X', 'Y', 'Z'])
@pytest.mark.parametrize('df,encoding', [
(DataFrame([[0.123456, 0.234567, 0.567567],
[12.32112, 123123.2, 321321.2]],
index=['A', 'B'], columns=['X', 'Y', 'Z']), None),
# GH 21241, 21118
(DataFrame([['abc', 'def', 'ghi']], columns=['X', 'Y', 'Z']), 'ascii'),
(DataFrame(5 * [[123, u"你好", u"世界"]],
columns=['X', 'Y', 'Z']), 'gb2312'),
(DataFrame(5 * [[123, u"Γειά σου", u"Κόσμε"]],
columns=['X', 'Y', 'Z']), 'cp737')
])
def test_to_csv_compression(self, df, encoding, compression):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as below, add the issue numbers as comments

with ensure_clean() as filename:

df.to_csv(filename, compression=compression)
df.to_csv(filename, compression=compression, encoding=encoding)

# test the round trip - to_csv -> read_csv
rs = read_csv(filename, compression=compression,
index_col=0)
assert_frame_equal(df, rs)
result = read_csv(filename, compression=compression,
index_col=0, encoding=encoding)
assert_frame_equal(df, result)

# explicitly make sure file is compressed
with tm.decompress_file(filename, compression) as fh:
text = fh.read().decode('utf8')
text = fh.read().decode(encoding or 'utf8')
for col in df.columns:
assert col in text

with tm.decompress_file(filename, compression) as fh:
assert_frame_equal(df, read_csv(fh, index_col=0))
assert_frame_equal(df, read_csv(fh,
index_col=0,
encoding=encoding))

def test_to_csv_date_format(self):
with ensure_clean('__tmp_to_csv_date_format__') as path:
Expand Down
27 changes: 17 additions & 10 deletions pandas/tests/series/test_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -137,29 +137,36 @@ def test_to_csv_path_is_none(self):
csv_str = s.to_csv(path=None)
assert isinstance(csv_str, str)

def test_to_csv_compression(self, compression):

s = Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'],
name='X')
@pytest.mark.parametrize('s,encoding', [
(Series([0.123456, 0.234567, 0.567567], index=['A', 'B', 'C'],
name='X'), None),
# GH 21241, 21118
(Series(['abc', 'def', 'ghi'], name='X'), 'ascii'),
(Series(["123", u"你好", u"世界"], name=u"中文"), 'gb2312'),
(Series(["123", u"Γειά σου", u"Κόσμε"], name=u"Ελληνικά"), 'cp737')
])
def test_to_csv_compression(self, s, encoding, compression):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the releveant issues here as comments (e.g. the issues you are closing)

with ensure_clean() as filename:

s.to_csv(filename, compression=compression, header=True)
s.to_csv(filename, compression=compression, encoding=encoding,
header=True)

# test the round trip - to_csv -> read_csv
rs = pd.read_csv(filename, compression=compression,
index_col=0, squeeze=True)
assert_series_equal(s, rs)
result = pd.read_csv(filename, compression=compression,
encoding=encoding, index_col=0, squeeze=True)
assert_series_equal(s, result)

# explicitly ensure file was compressed
with tm.decompress_file(filename, compression) as fh:
text = fh.read().decode('utf8')
text = fh.read().decode(encoding or 'utf8')
assert s.name in text

with tm.decompress_file(filename, compression) as fh:
assert_series_equal(s, pd.read_csv(fh,
index_col=0,
squeeze=True))
squeeze=True,
encoding=encoding))


class TestSeriesIO(TestData):
Expand Down
5 changes: 3 additions & 2 deletions pandas/tests/test_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -252,12 +252,13 @@ def test_compression_size_fh(obj, method, compression_only):
with tm.ensure_clean() as filename:
with open(filename, 'w') as fh:
getattr(obj, method)(fh, compression=compression_only)
# GH 17778
assert fh.closed
assert not fh.closed
assert fh.closed
compressed = os.path.getsize(filename)
with tm.ensure_clean() as filename:
with open(filename, 'w') as fh:
getattr(obj, method)(fh, compression=None)
assert not fh.closed
assert fh.closed
uncompressed = os.path.getsize(filename)
assert uncompressed > compressed