SparseDataFrame.to_parquet fails #20692

jamestwebber · 2018-04-14T01:50:48Z

Code Sample, a copy-pastable example if possible

import pandas as pd # v0.22.0
import scipy.sparse # v1.0.1

rpd = pd.SparseDataFrame(scipy.sparse.random(1000, 1000), 
                         columns=list(map(str, range(1000))),
                         default_fill_value=0.0)
rpd.to_parquet('rpd.pq')

---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
<ipython-input-65-1aeaae9e36a0> in <module>()
      4                          columns=list(map(str, range(1000))),
      5                          default_fill_value=0.0)
----> 6 rpd.to_parquet('rpd.pq')

...

ArrowIOError: Column 8 had 4 while previous column had 8

Problem description

SparseDataFrames and parquet should be a match made in data science heaven, because parquet should be able to compress the sparse columns and get big space and IO savings. But the to_parquet method seems to be very unhappy when it gets a sparse dataframe.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.6.4.final.0 python-bits: 64 OS: Darwin OS-release: 17.4.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.5.0
pip: 9.0.3
setuptools: 39.0.1
Cython: None
numpy: 1.14.2
scipy: 1.0.1
pyarrow: 0.7.1
xarray: None
IPython: 6.3.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2018-04-14T12:19:46Z

you should open an issue on the arrow tracker for support. pandas sparse format is somewhat bespoke and not likely to be supported. maybe a more common COO format might.

jreback added Sparse Sparse Data Type IO Parquet parquet, feather labels Apr 14, 2018

jreback added this to the No action milestone Apr 14, 2018

jreback closed this as completed Apr 14, 2018

jamestwebber mentioned this issue Apr 14, 2018

Support for sparse dataframes apache/arrow#1894

Closed

cornhundred mentioned this issue May 13, 2019

SparseDataFrame.to_parquet fails with new error #26378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparseDataFrame.to_parquet fails #20692

SparseDataFrame.to_parquet fails #20692

jamestwebber commented Apr 14, 2018

jreback commented Apr 14, 2018

SparseDataFrame.to_parquet fails #20692

SparseDataFrame.to_parquet fails #20692

Comments

jamestwebber commented Apr 14, 2018

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

jreback commented Apr 14, 2018

Output of `pd.show_versions()`