-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
stack() method of SparseDataFrame should return a SparseSeries and optimize memory usage #15045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think your use-case sounds reasonable. API-wise, I guess we would add a Would you be interested in submitting a PR? |
I'll try to work on a PR. But do you think it will be a case when a user calling stack in a SparseDataFrame wants it to be stacked as dense? The sparse keyword argument doesn't seem necessary to me. But if you think there is a reason for that, I'm happy to have it in the implementation. |
I'm having trouble coming up with a case where |
this is a dupe of #14493. though those are essentially for unstack, so I guess we can leave this one. |
note this is actually non-trivial. We are not simply doing |
One concern is a case when stacking columns have different |
After #16616, a sparse SparseSeries is returned, but the frame is still densified interim: pandas/pandas/core/reshape/reshape.py Line 548 in 7930202
|
Code Sample, a copy-pastable example if possible
Problem description
I'm trying to convert a SparseDataFrame (obtained it from pd.get_dummies()) into a scipy sparse matrix, by using the experimental .to_coo(). As this method accepts a MultiIndex Series, instead of a DataFrame, i call the .stack() method of this SparseDataFrame.
The problem is that it looks like the .stack() method doesn't process the SparseDataFrame as sparse, and instead stacks it as dense, consuming too much memory, and returning a (dense) Series.
Returning a dense Series could be all right, as np.nan values are drop by default with the dropna parameters, but the memory consumption is a problem.
I'm aware the whole sparse functionality is not yet mature. And I saw the function pd.sparse.frame.stack_sparse_frame which I guess it's a step to fix this problem (which doesn't work for me). But as I couldn't find a specific issue for this problem, I thought it was worth opening it.
Expected Output
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: