SparseDataFrame.to_parquet fails with new error #26378

cornhundred · 2019-05-13T21:27:53Z

Code Sample

import pandas as pd # v0.24.2
import scipy.sparse # v1.1.0

df = pd.SparseDataFrame(scipy.sparse.random(1000, 1000), 
                         columns=list(map(str, range(1000))),
                         default_fill_value=0.0)
df.to_parquet('rpd.pq', engine='pyarrow')

Gives the error

ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column 0 with type Sparse[float64, 0.0]')

Problem description

This error occurs when trying to save a Pandas sparse DataFrame using the to_parquet method. The error can be avoided by running df.to_dense().to_parquet(). However, this can require a lot of memory for very large sparse matrices.

The issue was also raised apache/arrow#1894 and #20692

Expected Output

The expected output is a parquet file on disk.

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 3.9.1
pip: 19.0.3
setuptools: 40.2.0
Cython: None
numpy: 1.16.3
scipy: 1.1.0
pyarrow: 0.13.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.1.2
lxml.etree: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

jreback · 2019-05-13T21:54:57Z

this is not a pandas issue, it is up to arrow whether (or more likely not) to support this format.

We are deprecating SparseDataFrame, but supporting SparseArray as an extension type, so this might be supported in the future.

cornhundred · 2019-05-13T22:29:13Z

Okay, @wesm recommend making the issue here apache/arrow#1894 (comment)

@jreback SparseDataFrame is being deprecated? So it will not be possible to have a sparse Pandas DataFrame in future versions? Or will it be possible to make one using the Sparse array extension type?

https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html

TomAugspurger · 2019-05-14T02:28:45Z

The SparseDataFrame subclass is being deprecated. It's functionally equivalent to a DataFrame with sparse values.

jorisvandenbossche · 2019-05-14T06:15:34Z

And support for SparseArrays in to_parquet / arrow might depend on the discussion in #20612

cornhundred · 2019-05-14T19:10:33Z

Thanks, @TomAugspurger @jreback @wesm. Is there an example of making a Pandas DataFrame from SparseArray values?

I'm trying this out on this kaggle kernel using the sparr variable from the documentation (https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparsearray), but the DataFrame does not appear sparse.

link to kernel (fork to re-run) - https://www.kaggle.com/cornhundred/pandas-dataframe-from-sparsearray?scriptVersionId=14173944

cc @melaniedavila @manugarciaquismondo

jorisvandenbossche · 2019-05-14T19:21:32Z

We should clearly update the user guide on this (http://pandas-docs.github.io/pandas-docs-travis/user_guide/sparse.html), as that still shows the "old" way. @TomAugspurger is adding some documentation in his PR to deprecated the subclass: #26137

But basically, if you have SparseArray values, you can put them in a DataFrame by using the DataFrame constructor as normal, eg:

In [40]: arr = pd.SparseArray([0,0,0,1])

In [41]: arr
Out[41]: 
[0, 0, 0, 1]
Fill: 0
IntIndex
Indices: array([3], dtype=int32)

In [42]: df = pd.DataFrame({'a': arr})

In [43]: df
Out[43]: 
   a
0  0
1  0
2  0
3  1

In [44]: df.dtypes 
Out[44]: 
a    Sparse[int64, 0]
dtype: object

(what version of pandas are you using?)

Feedback on using it in a normal pandas DataFrame instead of the SparseDataFrame subclass is very welcome! (we are all not very regular users of the sparse functionality)

jorisvandenbossche · 2019-05-14T19:22:52Z

@cornhundred thanks for the notebook. From seeing the output there, I assume you are using an older version of Python? (the SparseArray support inside DataFrame itself is only availabe in 0.24)

cornhundred · 2019-05-14T20:20:31Z

Thanks @jorisvandenbossche. I modified your example a bit and got it to run on Google Colab, which is running Pandas 0.24.2:

https://colab.research.google.com/gist/cornhundred/c231f02b2edbc83f466756915ffdfbab/sparsearray_to_dataframe_pandas_0-24-2.ipynb

The DataFrame made with sparse data is smaller on memory than the dense matrix. The original issue with saving the sparse DataFrame to parquet is demonstrated at the bottom of the notebook.

Kaggle however, is running Pandas 0.23.4

https://www.kaggle.com/cornhundred/pandas-dataframe-from-sparsearray-0-23-4?scriptVersionId=14175226

In terms of how we are using sparse data - we start by loading a sparse matrix (of single cell gene expression data) in Matrix Market format (MTX) using scipy.io.mmread, perform some filtering on the data and then save back to a new Matrix Market format (directory) using scipy.io.mmwrite. The scipy read/write functionality allow us to load and save data directly to scipy sparse matrix format (coo_matrix) without having to make it dense (which would cause us to run out of RAM).

We're looking into parquet since it allows you to read select columns without loading the entire dataset (as well as predicate pushdown for row group filtering). However, it seems that we first have to convert to dense matrices before saving to parquet (see bottom of colab notebook gist). Ideally we could have the same sparse matrix IO we have with the Matrix Market format but instead with parquet.

I'm looking into pyarrow to see if they have this functionality https://arrow.apache.org/docs/python/parquet.html#reading-and-writing-single-files

cornhundred · 2019-05-17T18:58:46Z

Hi @jorisvandenbossche, it's probably a naive question but SparseArray is one dimensional (as far as I understand) so to make a 2D DataFrame do I have to make a bunch of series and then combine them into a DataFrame? Are there methods (e.g. df.to_sparse and df.to_dense) that exist (or are planned) to support easy swapping between sparse and dense DataFrames (using SparseArray as an extension)?

cc @manugarciaquismondo

jorisvandenbossche · 2019-05-18T13:34:41Z

@cornhundred yes, if you have a DataFrame with sparse columns, it is each column that is separately stored as a 1D sparse array (that was the same before with the SparseDataFrame as well).

But you can convert a 2D sparse matrix into that format without needing to make a full dense array. With the currently released version, the pd.SparseDataFrame(..) constructor accepts a scipy matrix, and in the upcoming version this will be replaced with pd.DataFrame.sparse.from_spmatrix.
And also going from sparse to dense exists, as you mentioned with to_dense()

cornhundred · 2019-05-18T15:21:57Z

Thanks @jorisvandenbossche that makes sense.

jreback closed this as completed May 13, 2019

cornhundred mentioned this issue May 21, 2019

[FEA] SparseDataFrame data structure rapidsai/cudf#1790

Closed

joaquinvanschoren mentioned this issue Apr 27, 2022

OpenML Sparse Dataset support openml/openml-data#46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparseDataFrame.to_parquet fails with new error #26378

SparseDataFrame.to_parquet fails with new error #26378

cornhundred commented May 13, 2019

INSTALLED VERSIONS

jreback commented May 13, 2019

cornhundred commented May 13, 2019

TomAugspurger commented May 14, 2019

jorisvandenbossche commented May 14, 2019

cornhundred commented May 14, 2019 •

edited

Loading

jorisvandenbossche commented May 14, 2019

jorisvandenbossche commented May 14, 2019

cornhundred commented May 14, 2019

cornhundred commented May 17, 2019 •

edited

Loading

jorisvandenbossche commented May 18, 2019

cornhundred commented May 18, 2019

SparseDataFrame.to_parquet fails with new error #26378

SparseDataFrame.to_parquet fails with new error #26378

Comments

cornhundred commented May 13, 2019

Code Sample

Problem description

Expected Output

INSTALLED VERSIONS

jreback commented May 13, 2019

cornhundred commented May 13, 2019

TomAugspurger commented May 14, 2019

jorisvandenbossche commented May 14, 2019

cornhundred commented May 14, 2019 • edited Loading

jorisvandenbossche commented May 14, 2019

jorisvandenbossche commented May 14, 2019

cornhundred commented May 14, 2019

cornhundred commented May 17, 2019 • edited Loading

jorisvandenbossche commented May 18, 2019

cornhundred commented May 18, 2019

cornhundred commented May 14, 2019 •

edited

Loading

cornhundred commented May 17, 2019 •

edited

Loading