Skip to content

API: DataFrame.sparse accessor #25682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 14, 2019

Conversation

TomAugspurger
Copy link
Contributor

Closes #25681

@TomAugspurger TomAugspurger added the Sparse Sparse Data Type label Mar 12, 2019
@TomAugspurger TomAugspurger added this to the 0.25.0 milestone Mar 12, 2019
Copy link
Contributor Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would welcome feedback from sparse users and whether I'm missing any functionality from SparseDataFrame. But I think this covers it.

Will follow up with a deprecation for SparseDataFrame and SparseSeries later in the week.

Ratio of non-sparse points to total (dense) data points
represented in the DataFrame.
"""
return np.mean([column.array.density
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not taking the mean, and returning a Series instead, be more useful?

@codecov
Copy link

codecov bot commented Mar 12, 2019

Codecov Report

Merging #25682 into master will increase coverage by <.01%.
The diff coverage is 93.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25682      +/-   ##
==========================================
+ Coverage   91.29%   91.29%   +<.01%     
==========================================
  Files         173      173              
  Lines       52961    53013      +52     
==========================================
+ Hits        48349    48398      +49     
- Misses       4612     4615       +3
Flag Coverage Δ
#multiple 89.87% <93.42%> (ø) ⬆️
#single 41.72% <22.36%> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/sparse/frame.py 96% <100%> (+0.29%) ⬆️
pandas/core/frame.py 96.79% <100%> (ø) ⬆️
pandas/core/arrays/sparse.py 92.19% <92.85%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21769e9...24f48c3. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 12, 2019

Codecov Report

Merging #25682 into master will decrease coverage by <.01%.
The diff coverage is 97.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25682      +/-   ##
==========================================
- Coverage   91.68%   91.68%   -0.01%     
==========================================
  Files         174      174              
  Lines       50704    50747      +43     
==========================================
+ Hits        46489    46528      +39     
- Misses       4215     4219       +4
Flag Coverage Δ
#multiple 90.19% <97.75%> (ø) ⬆️
#single 41.19% <25.84%> (-0.14%) ⬇️
Impacted Files Coverage Δ
pandas/core/sparse/frame.py 95.63% <100%> (+0.14%) ⬆️
pandas/core/frame.py 97.02% <100%> (-0.12%) ⬇️
pandas/core/arrays/sparse.py 92.7% <97.46%> (+0.39%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2123a96...f23fa52. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. small test question & anyway to shared doc-strings?

@pytest.mark.parametrize('dtype', ['float64', 'int64'])
@td.skip_if_no_scipy
def test_from_spmatrix(self, format, labels, dtype):
import scipy.sparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm shouldn't we move the scipy specific test to a new file then just pyimportorskip at the top?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference to keep all the accessor tests in a single file / class.

@jreback
Copy link
Contributor

jreback commented Apr 5, 2019

@TomAugspurger if you can merge master

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Apr 5, 2019 via email

@TomAugspurger
Copy link
Contributor Author

Ping, assuming CI passes.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments - just looking at the implementation to familiarize myself with it

SparseArray.from_spmatrix(data[:, i])
for i in range(data.shape[1])
]
data = dict(zip(columns, sparrays))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how often we use this construction but I assume this preclude a user from specifying a MI or anything with duplicated index entries due to hashability / uniqueness constraints of dict keys

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point.

I'd like to avoid the perf issue with passing columns= to the DataFrame constructor... I suppose our alternative is to just set .columns after creating the DataFrame?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wasn't aware of the perf issue - is there an open issue for that?

Yea think assigning directly would be a better approach

from pandas import DataFrame

data = {k: v.array.to_dense()
for k, v in self._parent.iteritems()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iteritems() -> items()

represented in the DataFrame.
"""
return np.mean([column.array.density
for _, column in self._parent.iteritems()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use items()


@classmethod
def from_spmatrix(cls, data, index=None, columns=None):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming you are defining this here because then we can simply deprecate SparseDataFrame as this is much simpler / direct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is the replacement for SparseDataFrame(sp_matrix).

"""
Return the contents of the frame as a sparse SciPy COO matrix.

.. versionadded:: 0.20.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think update to 0.25.0, because even though not new, this is defined in a new place

return coo_matrix((datas, (rows, cols)), shape=self._parent.shape)

@property
def density(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can add type annotations anywhere it is easy would be nice.

reason='NumPy-11383')),
10
])
def test_from_spmatrix(self, size, format):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use @td.skip_if_no_scipy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@TomAugspurger
Copy link
Contributor Author

Pushed a fix for column handling.

I don’t plan to spend more time on this. If it isn’t ready, I’d suggest a doc-only deprecation of sparse series and sparse DataFrame.

@jreback
Copy link
Contributor

jreback commented May 7, 2019

@TomAugspurger this looked fine, just a couple of questions above. (and merge master);

@jreback
Copy link
Contributor

jreback commented May 12, 2019

@TomAugspurger just a few questions

@TomAugspurger
Copy link
Contributor Author

All green. I'll update #26137 once this is in.

@jreback jreback merged commit 0558a3c into pandas-dev:master May 14, 2019
@jreback
Copy link
Contributor

jreback commented May 14, 2019

thanks @TomAugspurger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add DataFrame.sparse accessor
3 participants