stack() method of SparseDataFrame should return a SparseSeries and optimize memory usage #15045

datapythonista · 2017-01-03T17:37:57Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df = pd.DataFrame({'a': [np.nan, np.nan, np.nan, np.nan, np.nan, 1., np.nan],
                   'b': [1., np.nan, np.nan, 1., np.nan, np.nan, np.nan]}).to_sparse()
print(type(df))
print(type(df.stack()))

<class 'pandas.sparse.frame.SparseDataFrame'>
<class 'pandas.core.series.Series'>

Problem description

I'm trying to convert a SparseDataFrame (obtained it from pd.get_dummies()) into a scipy sparse matrix, by using the experimental .to_coo(). As this method accepts a MultiIndex Series, instead of a DataFrame, i call the .stack() method of this SparseDataFrame.

The problem is that it looks like the .stack() method doesn't process the SparseDataFrame as sparse, and instead stacks it as dense, consuming too much memory, and returning a (dense) Series.

Returning a dense Series could be all right, as np.nan values are drop by default with the dropna parameters, but the memory consumption is a problem.

I'm aware the whole sparse functionality is not yet mature. And I saw the function pd.sparse.frame.stack_sparse_frame which I guess it's a step to fix this problem (which doesn't work for me). But as I couldn't find a specific issue for this problem, I thought it was worth opening it.

Expected Output

<class 'pandas.sparse.frame.SparseDataFrame'>
<class 'pandas.sparse.series.SparseSeries'>

Output of `pd.show_versions()`

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.7.5-100.fc23.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.utf8
LOCALE: en_US.UTF-8

pandas: 0.19.2+0.g825876c.dirty
nose: 1.3.7
pip: 9.0.1
setuptools: 23.0.0
Cython: 0.24
numpy: 1.11.1
scipy: 0.17.1
statsmodels: 0.6.1
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.0
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2017-01-03T22:13:52Z

I think your use-case sounds reasonable. API-wise, I guess we would add a sparse=True/False keyword to SparseDataFrame.stack. Would the default be True or False? I would think True.

Would you be interested in submitting a PR?

datapythonista · 2017-01-03T22:32:19Z

I'll try to work on a PR.

But do you think it will be a case when a user calling stack in a SparseDataFrame wants it to be stacked as dense? The sparse keyword argument doesn't seem necessary to me. But if you think there is a reason for that, I'm happy to have it in the implementation.

TomAugspurger · 2017-01-03T22:38:14Z

But do you think it will be a case when a user calling stack in a SparseDataFrame wants it to be stacked as dense?

I'm having trouble coming up with a case where .stack densifying would be desired. @sinhrks thoughts?

jreback · 2017-01-03T23:48:26Z

this is a dupe of #14493. though those are essentially for unstack, so I guess we can leave this one.

jreback · 2017-01-03T23:50:18Z

note this is actually non-trivial. We are not simply doing .to_sparse() on the data, rather constructing it directly (which is way more efficient).

sinhrks · 2017-01-04T01:53:39Z

One concern is a case when stacking columns have different fill_value (a value omitted in sparse repr). In this case, the result is not efficient in sparse repr. Should raise?

kernc · 2017-08-11T13:39:49Z

After #16616, a sparse SparseSeries is returned, but the frame is still densified interim:

pandas/pandas/core/reshape/reshape.py

Line 548 in 7930202

new_values = frame.values.ravel()

TomAugspurger added API Design Enhancement Sparse Sparse Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode Effort Low labels Jan 3, 2017

TomAugspurger added this to the 0.20.0 milestone Jan 3, 2017

jreback closed this as completed Jan 3, 2017

jreback reopened this Jan 3, 2017

jreback modified the milestones: Next Major Release, 0.20.0 Jan 3, 2017

jreback added Difficulty Advanced and removed Difficulty Novice labels Jan 3, 2017

kernc mentioned this issue Jun 6, 2017

BUG: Fix/test SparseSeries/SparseDataFrame stack/unstack #16616

Merged

4 tasks

jreback modified the milestones: 0.21.0, Next Major Release Jul 22, 2017

jreback closed this as completed in #16616 Sep 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stack() method of SparseDataFrame should return a SparseSeries and optimize memory usage #15045

stack() method of SparseDataFrame should return a SparseSeries and optimize memory usage #15045

datapythonista commented Jan 3, 2017

TomAugspurger commented Jan 3, 2017

datapythonista commented Jan 3, 2017

TomAugspurger commented Jan 3, 2017 •

edited

Loading

jreback commented Jan 3, 2017 •

edited

Loading

jreback commented Jan 3, 2017

sinhrks commented Jan 4, 2017

kernc commented Aug 11, 2017

stack() method of SparseDataFrame should return a SparseSeries and optimize memory usage #15045

stack() method of SparseDataFrame should return a SparseSeries and optimize memory usage #15045

Comments

datapythonista commented Jan 3, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

TomAugspurger commented Jan 3, 2017

datapythonista commented Jan 3, 2017

TomAugspurger commented Jan 3, 2017 • edited Loading

jreback commented Jan 3, 2017 • edited Loading

jreback commented Jan 3, 2017

sinhrks commented Jan 4, 2017

kernc commented Aug 11, 2017

Output of `pd.show_versions()`

TomAugspurger commented Jan 3, 2017 •

edited

Loading

jreback commented Jan 3, 2017 •

edited

Loading