Skip to content

stack() method of SparseDataFrame should return a SparseSeries and optimize memory usage #15045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datapythonista opened this issue Jan 3, 2017 · 7 comments · Fixed by #16616
Labels
API Design Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Milestone

Comments

@datapythonista
Copy link
Member

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': [np.nan, np.nan, np.nan, np.nan, np.nan, 1., np.nan],
                   'b': [1., np.nan, np.nan, 1., np.nan, np.nan, np.nan]}).to_sparse()
print(type(df))
print(type(df.stack()))

<class 'pandas.sparse.frame.SparseDataFrame'>
<class 'pandas.core.series.Series'>

Problem description

I'm trying to convert a SparseDataFrame (obtained it from pd.get_dummies()) into a scipy sparse matrix, by using the experimental .to_coo(). As this method accepts a MultiIndex Series, instead of a DataFrame, i call the .stack() method of this SparseDataFrame.

The problem is that it looks like the .stack() method doesn't process the SparseDataFrame as sparse, and instead stacks it as dense, consuming too much memory, and returning a (dense) Series.

Returning a dense Series could be all right, as np.nan values are drop by default with the dropna parameters, but the memory consumption is a problem.

I'm aware the whole sparse functionality is not yet mature. And I saw the function pd.sparse.frame.stack_sparse_frame which I guess it's a step to fix this problem (which doesn't work for me). But as I couldn't find a specific issue for this problem, I thought it was worth opening it.

Expected Output

<class 'pandas.sparse.frame.SparseDataFrame'>
<class 'pandas.sparse.series.SparseSeries'>

Output of pd.show_versions()

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.7.5-100.fc23.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: en_US.UTF-8

pandas: 0.19.2+0.g825876c.dirty
nose: 1.3.7
pip: 9.0.1
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

I think your use-case sounds reasonable. API-wise, I guess we would add a sparse=True/False keyword to SparseDataFrame.stack. Would the default be True or False? I would think True.

Would you be interested in submitting a PR?

@TomAugspurger TomAugspurger added API Design Enhancement Sparse Sparse Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode Effort Low labels Jan 3, 2017
@TomAugspurger TomAugspurger added this to the 0.20.0 milestone Jan 3, 2017
@datapythonista
Copy link
Member Author

I'll try to work on a PR.

But do you think it will be a case when a user calling stack in a SparseDataFrame wants it to be stacked as dense? The sparse keyword argument doesn't seem necessary to me. But if you think there is a reason for that, I'm happy to have it in the implementation.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 3, 2017

But do you think it will be a case when a user calling stack in a SparseDataFrame wants it to be stacked as dense?

I'm having trouble coming up with a case where .stack densifying would be desired. @sinhrks thoughts?

@jreback
Copy link
Contributor

jreback commented Jan 3, 2017

this is a dupe of #14493. though those are essentially for unstack, so I guess we can leave this one.

@jreback jreback closed this as completed Jan 3, 2017
@jreback jreback reopened this Jan 3, 2017
@jreback jreback modified the milestones: Next Major Release, 0.20.0 Jan 3, 2017
@jreback
Copy link
Contributor

jreback commented Jan 3, 2017

note this is actually non-trivial. We are not simply doing .to_sparse() on the data, rather constructing it directly (which is way more efficient).

@sinhrks
Copy link
Member

sinhrks commented Jan 4, 2017

One concern is a case when stacking columns have different fill_value (a value omitted in sparse repr). In this case, the result is not efficient in sparse repr. Should raise?

@jreback jreback modified the milestones: 0.21.0, Next Major Release Jul 22, 2017
@kernc
Copy link
Contributor

kernc commented Aug 11, 2017

After #16616, a sparse SparseSeries is returned, but the frame is still densified interim:

new_values = frame.values.ravel()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants