-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
SparseDataFrame.to_parquet fails with new error #26378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
this is not a pandas issue, it is up to arrow whether (or more likely not) to support this format. We are deprecating SparseDataFrame, but supporting SparseArray as an extension type, so this might be supported in the future. |
Okay, @wesm recommend making the issue here apache/arrow#1894 (comment) @jreback SparseDataFrame is being deprecated? So it will not be possible to have a sparse Pandas DataFrame in future versions? Or will it be possible to make one using the Sparse array extension type? https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html |
The SparseDataFrame subclass is being deprecated. It's functionally equivalent to a DataFrame with sparse values. |
And support for SparseArrays in to_parquet / arrow might depend on the discussion in #20612 |
Thanks, @TomAugspurger @jreback @wesm. Is there an example of making a Pandas DataFrame from SparseArray values? I'm trying this out on this kaggle kernel using the link to kernel (fork to re-run) - https://www.kaggle.com/cornhundred/pandas-dataframe-from-sparsearray?scriptVersionId=14173944 |
We should clearly update the user guide on this (http://pandas-docs.github.io/pandas-docs-travis/user_guide/sparse.html), as that still shows the "old" way. @TomAugspurger is adding some documentation in his PR to deprecated the subclass: #26137 But basically, if you have SparseArray values, you can put them in a DataFrame by using the DataFrame constructor as normal, eg:
(what version of pandas are you using?) Feedback on using it in a normal pandas DataFrame instead of the SparseDataFrame subclass is very welcome! (we are all not very regular users of the sparse functionality) |
@cornhundred thanks for the notebook. From seeing the output there, I assume you are using an older version of Python? (the SparseArray support inside DataFrame itself is only availabe in 0.24) |
Thanks @jorisvandenbossche. I modified your example a bit and got it to run on Google Colab, which is running Pandas 0.24.2: The DataFrame made with sparse data is smaller on memory than the dense matrix. The original issue with saving the sparse DataFrame to parquet is demonstrated at the bottom of the notebook. Kaggle however, is running Pandas 0.23.4 https://www.kaggle.com/cornhundred/pandas-dataframe-from-sparsearray-0-23-4?scriptVersionId=14175226 In terms of how we are using sparse data - we start by loading a sparse matrix (of single cell gene expression data) in Matrix Market format (MTX) using We're looking into parquet since it allows you to read select columns without loading the entire dataset (as well as predicate pushdown for row group filtering). However, it seems that we first have to convert to dense matrices before saving to parquet (see bottom of colab notebook gist). Ideally we could have the same sparse matrix IO we have with the Matrix Market format but instead with parquet. I'm looking into pyarrow to see if they have this functionality https://arrow.apache.org/docs/python/parquet.html#reading-and-writing-single-files |
Hi @jorisvandenbossche, it's probably a naive question but SparseArray is one dimensional (as far as I understand) so to make a 2D DataFrame do I have to make a bunch of series and then combine them into a DataFrame? Are there methods (e.g. |
@cornhundred yes, if you have a DataFrame with sparse columns, it is each column that is separately stored as a 1D sparse array (that was the same before with the SparseDataFrame as well). But you can convert a 2D sparse matrix into that format without needing to make a full dense array. With the currently released version, the |
Thanks @jorisvandenbossche that makes sense. |
Code Sample
Gives the error
Problem description
This error occurs when trying to save a Pandas sparse DataFrame using the
to_parquet
method. The error can be avoided by runningdf.to_dense().to_parquet()
. However, this can require a lot of memory for very large sparse matrices.The issue was also raised apache/arrow#1894 and #20692
Expected Output
The expected output is a parquet file on disk.
INSTALLED VERSIONS
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: 3.9.1
pip: 19.0.3
setuptools: 40.2.0
Cython: None
numpy: 1.16.3
scipy: 1.1.0
pyarrow: 0.13.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.1
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.1.2
lxml.etree: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: