Skip to content

SparseDataFrame.to_parquet fails with new error #26378

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cornhundred opened this issue May 13, 2019 · 11 comments
Closed

SparseDataFrame.to_parquet fails with new error #26378

cornhundred opened this issue May 13, 2019 · 11 comments

Comments

@cornhundred
Copy link

Code Sample

import pandas as pd # v0.24.2
import scipy.sparse # v1.1.0

df = pd.SparseDataFrame(scipy.sparse.random(1000, 1000), 
                         columns=list(map(str, range(1000))),
                         default_fill_value=0.0)
df.to_parquet('rpd.pq', engine='pyarrow')

Gives the error

ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column 0 with type Sparse[float64, 0.0]')

Problem description

This error occurs when trying to save a Pandas sparse DataFrame using the to_parquet method. The error can be avoided by running df.to_dense().to_parquet(). However, this can require a lot of memory for very large sparse matrices.

The issue was also raised apache/arrow#1894 and #20692

Expected Output

The expected output is a parquet file on disk.

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 3.9.1
pip: 19.0.3
setuptools: 40.2.0
Cython: None
numpy: 1.16.3
scipy: 1.1.0
pyarrow: 0.13.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.1.2
lxml.etree: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jreback
Copy link
Contributor

jreback commented May 13, 2019

this is not a pandas issue, it is up to arrow whether (or more likely not) to support this format.

We are deprecating SparseDataFrame, but supporting SparseArray as an extension type, so this might be supported in the future.

@jreback jreback closed this as completed May 13, 2019
@cornhundred
Copy link
Author

Okay, @wesm recommend making the issue here apache/arrow#1894 (comment)

@jreback SparseDataFrame is being deprecated? So it will not be possible to have a sparse Pandas DataFrame in future versions? Or will it be possible to make one using the Sparse array extension type?

https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html

@TomAugspurger
Copy link
Contributor

The SparseDataFrame subclass is being deprecated. It's functionally equivalent to a DataFrame with sparse values.

@jorisvandenbossche
Copy link
Member

And support for SparseArrays in to_parquet / arrow might depend on the discussion in #20612

@cornhundred
Copy link
Author

cornhundred commented May 14, 2019

Thanks, @TomAugspurger @jreback @wesm. Is there an example of making a Pandas DataFrame from SparseArray values?

I'm trying this out on this kaggle kernel using the sparr variable from the documentation (https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#sparsearray), but the DataFrame does not appear sparse.

link to kernel (fork to re-run) - https://www.kaggle.com/cornhundred/pandas-dataframe-from-sparsearray?scriptVersionId=14173944

cc @melaniedavila @manugarciaquismondo

@jorisvandenbossche
Copy link
Member

We should clearly update the user guide on this (http://pandas-docs.github.io/pandas-docs-travis/user_guide/sparse.html), as that still shows the "old" way. @TomAugspurger is adding some documentation in his PR to deprecated the subclass: #26137

But basically, if you have SparseArray values, you can put them in a DataFrame by using the DataFrame constructor as normal, eg:

In [40]: arr = pd.SparseArray([0,0,0,1])

In [41]: arr
Out[41]: 
[0, 0, 0, 1]
Fill: 0
IntIndex
Indices: array([3], dtype=int32)

In [42]: df = pd.DataFrame({'a': arr})

In [43]: df
Out[43]: 
   a
0  0
1  0
2  0
3  1

In [44]: df.dtypes 
Out[44]: 
a    Sparse[int64, 0]
dtype: object

(what version of pandas are you using?)

Feedback on using it in a normal pandas DataFrame instead of the SparseDataFrame subclass is very welcome! (we are all not very regular users of the sparse functionality)

@jorisvandenbossche
Copy link
Member

@cornhundred thanks for the notebook. From seeing the output there, I assume you are using an older version of Python? (the SparseArray support inside DataFrame itself is only availabe in 0.24)

@cornhundred
Copy link
Author

Thanks @jorisvandenbossche. I modified your example a bit and got it to run on Google Colab, which is running Pandas 0.24.2:

https://colab.research.google.com/gist/cornhundred/c231f02b2edbc83f466756915ffdfbab/sparsearray_to_dataframe_pandas_0-24-2.ipynb

The DataFrame made with sparse data is smaller on memory than the dense matrix. The original issue with saving the sparse DataFrame to parquet is demonstrated at the bottom of the notebook.

Kaggle however, is running Pandas 0.23.4

https://www.kaggle.com/cornhundred/pandas-dataframe-from-sparsearray-0-23-4?scriptVersionId=14175226

In terms of how we are using sparse data - we start by loading a sparse matrix (of single cell gene expression data) in Matrix Market format (MTX) using scipy.io.mmread, perform some filtering on the data and then save back to a new Matrix Market format (directory) using scipy.io.mmwrite. The scipy read/write functionality allow us to load and save data directly to scipy sparse matrix format (coo_matrix) without having to make it dense (which would cause us to run out of RAM).

We're looking into parquet since it allows you to read select columns without loading the entire dataset (as well as predicate pushdown for row group filtering). However, it seems that we first have to convert to dense matrices before saving to parquet (see bottom of colab notebook gist). Ideally we could have the same sparse matrix IO we have with the Matrix Market format but instead with parquet.

I'm looking into pyarrow to see if they have this functionality https://arrow.apache.org/docs/python/parquet.html#reading-and-writing-single-files

@cornhundred
Copy link
Author

cornhundred commented May 17, 2019

Hi @jorisvandenbossche, it's probably a naive question but SparseArray is one dimensional (as far as I understand) so to make a 2D DataFrame do I have to make a bunch of series and then combine them into a DataFrame? Are there methods (e.g. df.to_sparse and df.to_dense) that exist (or are planned) to support easy swapping between sparse and dense DataFrames (using SparseArray as an extension)?

cc @manugarciaquismondo

@jorisvandenbossche
Copy link
Member

@cornhundred yes, if you have a DataFrame with sparse columns, it is each column that is separately stored as a 1D sparse array (that was the same before with the SparseDataFrame as well).

But you can convert a 2D sparse matrix into that format without needing to make a full dense array. With the currently released version, the pd.SparseDataFrame(..) constructor accepts a scipy matrix, and in the upcoming version this will be replaced with pd.DataFrame.sparse.from_spmatrix.
And also going from sparse to dense exists, as you mentioned with to_dense()

@cornhundred
Copy link
Author

Thanks @jorisvandenbossche that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants