Skip to content

to_csv() does not work with gzip module in py3 / py2 0.23.1 (works in py2 0.19.2) #21545

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kayvonr opened this issue Jun 19, 2018 · 3 comments
Closed
Labels
Duplicate Report Duplicate issue or pull request

Comments

@kayvonr
Copy link

kayvonr commented Jun 19, 2018

Code Sample, a copy-pastable example if possible

# py2 (0.19.2) - the stringio has the (gzip'd) dataframe as a csv. can write to file and read back.

~ $ ipython
Python 2.7.15 (default, May  1 2018, 16:44:37)
Type "copyright", "credits" or "license" for more information.

IPython 5.6.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: df = pd.DataFrame(np.random.randint(0,100,size=(5,3)), columns=['foo', 'bar', 'spam'])

In [2]: from cStringIO import StringIO

In [3]: io = StringIO()

In [4]: import gzip

In [5]: compressor = gzip.GzipFile(fileobj=io, mode='w')

In [6]: df.to_csv(compressor)

In [7]: compressor.close()

In [8]: io.seek(0)

In [9]: with open('/tmp/py2_output.gz', 'wb') as fh:
   ...:     fh.write(io.read())
   ...:

In [10]: io.seek(0)

In [11]: io.read()
Out[11]: '\x1f\x8b\x08\x00\xb7))[\x02\xff\x15\xc8\xbb\r\x800\x0c@\xc1\xde\xb3\xbc\x02;\x96?\xe3\x84\x82\x0e%\x82\xfd%\xc4\x95\xc7\xb5\x16\xe7|x\xf7\xbc\xe5 P\xa3J\x944\xd2\xe9\x10#\x1a\xc5K\x06I\x19\xd6\xe2\xff\x85bC>\xd1t\xb4GB\x00\x00\x00'


# py3 - the bytesio is empty, only contains the gzip header (from initial creation of gzipfile). same in py2 0.23.1

(pandas_test) pandas_test $ ipython
Python 3.6.5 (default, Apr 25 2018, 14:26:36)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: df = pd.DataFrame(np.random.randint(0,100,size=(5,3)), columns=['foo', 'bar', 'spam'])

In [2]: from io import BytesIO

In [3]: io = BytesIO()

In [4]: import gzip

In [5]: compressor = gzip.GzipFile(fileobj=io, mode='w')

In [6]: df.to_csv(compressor)

In [7]: compressor.close()

In [8]: io.seek(0)
Out[8]: 0

In [9]: with open('/tmp/py3_output.gz', 'wb') as fh:
   ...:     fh.write(io.read())
   ...:

In [10]: io.seek(0)
Out[10]: 0

In [11]: io.read()
Out[11]: b'\x1f\x8b\x08\x00)*)[\x02\xff\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00'

Problem description

Under python2 pandas 0.19.2 (i know it's old, we're working on getting up to date), writing to a gzipfile - that's either wrapping a stringio/bytesio OR writing to file on disk - results in the io/file containing the correct data. later versions it just results in empty io/file. is this intended, and if so is there a recommended way to write to a stringio/bytesio with compression?

Expected Output

Output of pd.show_versions()

py2, pandas 0.19.2 (working) versions:
In [1]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.20.1
numpy: 1.9.3
scipy: None
statsmodels: None
xarray: None
IPython: 5.6.0
sphinx: None
patsy: None
dateutil: 2.7.0
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.2.4
pymysql: None
psycopg2: 2.7 (dt dec pq3 ext lo64)
jinja2: 2.10
boto: 2.48.0
pandas_datareader: None

~*~*~*~*~*~*~*~*~*~*

py3, pandas 0.23.1 (not working) versions:
In [1]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.1
pytest: None
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.9.3
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.2.8
pymysql: None
psycopg2: 2.7.4 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@Liam3851
Copy link
Contributor

dupe of #21471

@kayvonr
Copy link
Author

kayvonr commented Jun 19, 2018

gah sorry @Liam3851 that did not show up somehow in my search

@WillAyd WillAyd added the Duplicate Report Duplicate issue or pull request label Jun 19, 2018
@WillAyd
Copy link
Member

WillAyd commented Jun 19, 2018

Yep thanks for the callout @Liam3851 . This should be fixed but if not feel free to re-open

@WillAyd WillAyd closed this as completed Jun 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request
Projects
None yet
Development

No branches or pull requests

3 participants