to_csv() does not work with gzip module in py3 / py2 0.23.1 (works in py2 0.19.2) #21545

kayvonr · 2018-06-19T16:32:55Z

Code Sample, a copy-pastable example if possible

# py2 (0.19.2) - the stringio has the (gzip'd) dataframe as a csv. can write to file and read back.

~ $ ipython
Python 2.7.15 (default, May  1 2018, 16:44:37)
Type "copyright", "credits" or "license" for more information.

IPython 5.6.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: df = pd.DataFrame(np.random.randint(0,100,size=(5,3)), columns=['foo', 'bar', 'spam'])

In [2]: from cStringIO import StringIO

In [3]: io = StringIO()

In [4]: import gzip

In [5]: compressor = gzip.GzipFile(fileobj=io, mode='w')

In [6]: df.to_csv(compressor)

In [7]: compressor.close()

In [8]: io.seek(0)

In [9]: with open('/tmp/py2_output.gz', 'wb') as fh:
   ...:     fh.write(io.read())
   ...:

In [10]: io.seek(0)

In [11]: io.read()
Out[11]: '\x1f\x8b\x08\x00\xb7))[\x02\xff\x15\xc8\xbb\r\x800\x0c@\xc1\xde\xb3\xbc\x02;\x96?\xe3\x84\x82\x0e%\x82\xfd%\xc4\x95\xc7\xb5\x16\xe7|x\xf7\xbc\xe5 P\xa3J\x944\xd2\xe9\x10#\x1a\xc5K\x06I\x19\xd6\xe2\xff\x85bC>\xd1t\xb4GB\x00\x00\x00'


# py3 - the bytesio is empty, only contains the gzip header (from initial creation of gzipfile). same in py2 0.23.1

(pandas_test) pandas_test $ ipython
Python 3.6.5 (default, Apr 25 2018, 14:26:36)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: df = pd.DataFrame(np.random.randint(0,100,size=(5,3)), columns=['foo', 'bar', 'spam'])

In [2]: from io import BytesIO

In [3]: io = BytesIO()

In [4]: import gzip

In [5]: compressor = gzip.GzipFile(fileobj=io, mode='w')

In [6]: df.to_csv(compressor)

In [7]: compressor.close()

In [8]: io.seek(0)
Out[8]: 0

In [9]: with open('/tmp/py3_output.gz', 'wb') as fh:
   ...:     fh.write(io.read())
   ...:

In [10]: io.seek(0)
Out[10]: 0

In [11]: io.read()
Out[11]: b'\x1f\x8b\x08\x00)*)[\x02\xff\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Problem description

Under python2 pandas 0.19.2 (i know it's old, we're working on getting up to date), writing to a gzipfile - that's either wrapping a stringio/bytesio OR writing to file on disk - results in the io/file containing the correct data. later versions it just results in empty io/file. is this intended, and if so is there a recommended way to write to a stringio/bytesio with compression?

Expected Output

Output of `pd.show_versions()`

py2, pandas 0.19.2 (working) versions:
In [1]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.20.1
numpy: 1.9.3
scipy: None
statsmodels: None
xarray: None
IPython: 5.6.0
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.2.4
pymysql: None
psycopg2: 2.7 (dt dec pq3 ext lo64)
jinja2: 2.10
boto: 2.48.0
pandas_datareader: None

~*~*~*~*~*~*~*~*~*~*

py3, pandas 0.23.1 (not working) versions:
In [1]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.1
pytest: None
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.9.3
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.8
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

Liam3851 · 2018-06-19T17:59:21Z

dupe of #21471

kayvonr · 2018-06-19T18:00:39Z

gah sorry @Liam3851 that did not show up somehow in my search

WillAyd · 2018-06-19T18:03:13Z

Yep thanks for the callout @Liam3851 . This should be fixed but if not feel free to re-open

WillAyd added the Duplicate Report Duplicate issue or pull request label Jun 19, 2018

WillAyd closed this as completed Jun 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_csv() does not work with gzip module in py3 / py2 0.23.1 (works in py2 0.19.2) #21545

to_csv() does not work with gzip module in py3 / py2 0.23.1 (works in py2 0.19.2) #21545

kayvonr commented Jun 19, 2018

INSTALLED VERSIONS

INSTALLED VERSIONS

Liam3851 commented Jun 19, 2018

kayvonr commented Jun 19, 2018

WillAyd commented Jun 19, 2018

to_csv() does not work with gzip module in py3 / py2 0.23.1 (works in py2 0.19.2) #21545

to_csv() does not work with gzip module in py3 / py2 0.23.1 (works in py2 0.19.2) #21545

Comments

kayvonr commented Jun 19, 2018

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

INSTALLED VERSIONS

Liam3851 commented Jun 19, 2018

kayvonr commented Jun 19, 2018

WillAyd commented Jun 19, 2018

Output of `pd.show_versions()`