-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: DataFrame.sparse accessor #25682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: DataFrame.sparse accessor #25682
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would welcome feedback from sparse users and whether I'm missing any functionality from SparseDataFrame. But I think this covers it.
Will follow up with a deprecation for SparseDataFrame and SparseSeries later in the week.
Ratio of non-sparse points to total (dense) data points | ||
represented in the DataFrame. | ||
""" | ||
return np.mean([column.array.density |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would not taking the mean, and returning a Series instead, be more useful?
Codecov Report
@@ Coverage Diff @@
## master #25682 +/- ##
==========================================
+ Coverage 91.29% 91.29% +<.01%
==========================================
Files 173 173
Lines 52961 53013 +52
==========================================
+ Hits 48349 48398 +49
- Misses 4612 4615 +3
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25682 +/- ##
==========================================
- Coverage 91.68% 91.68% -0.01%
==========================================
Files 174 174
Lines 50704 50747 +43
==========================================
+ Hits 46489 46528 +39
- Misses 4215 4219 +4
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. small test question & anyway to shared doc-strings?
@pytest.mark.parametrize('dtype', ['float64', 'int64']) | ||
@td.skip_if_no_scipy | ||
def test_from_spmatrix(self, format, labels, dtype): | ||
import scipy.sparse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm shouldn't we move the scipy specific test to a new file then just pyimportorskip at the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference to keep all the accessor tests in a single file / class.
@TomAugspurger if you can merge master |
Won't have time in the near term.
I'm not sure which docstrings you were suggesting to share.
I'd prefer to have all the accessors tests in a single place so I'll push
back against the suggestion to split the SciPy ones off.
…On Thu, Apr 4, 2019 at 7:56 PM Jeff Reback ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> if you can merge master
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25682 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHItV7eEh7t5DaqfKzxSFQ7kulfKkcks5vdp8hgaJpZM4bp_Wv>
.
|
Ping, assuming CI passes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments - just looking at the implementation to familiarize myself with it
pandas/core/arrays/sparse.py
Outdated
SparseArray.from_spmatrix(data[:, i]) | ||
for i in range(data.shape[1]) | ||
] | ||
data = dict(zip(columns, sparrays)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how often we use this construction but I assume this preclude a user from specifying a MI or anything with duplicated index entries due to hashability / uniqueness constraints of dict keys
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point.
I'd like to avoid the perf issue with passing columns=
to the DataFrame constructor... I suppose our alternative is to just set .columns
after creating the DataFrame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't aware of the perf issue - is there an open issue for that?
Yea think assigning directly would be a better approach
pandas/core/arrays/sparse.py
Outdated
from pandas import DataFrame | ||
|
||
data = {k: v.array.to_dense() | ||
for k, v in self._parent.iteritems()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iteritems() -> items()
pandas/core/arrays/sparse.py
Outdated
represented in the DataFrame. | ||
""" | ||
return np.mean([column.array.density | ||
for _, column in self._parent.iteritems()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use items()
|
||
@classmethod | ||
def from_spmatrix(cls, data, index=None, columns=None): | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming you are defining this here because then we can simply deprecate SparseDataFrame as this is much simpler / direct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this is the replacement for SparseDataFrame(sp_matrix)
.
pandas/core/arrays/sparse.py
Outdated
""" | ||
Return the contents of the frame as a sparse SciPy COO matrix. | ||
|
||
.. versionadded:: 0.20.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think update to 0.25.0, because even though not new, this is defined in a new place
pandas/core/arrays/sparse.py
Outdated
return coo_matrix((datas, (rows, cols)), shape=self._parent.shape) | ||
|
||
@property | ||
def density(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can add type annotations anywhere it is easy would be nice.
reason='NumPy-11383')), | ||
10 | ||
]) | ||
def test_from_spmatrix(self, size, format): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use @td.skip_if_no_scipy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Pushed a fix for column handling. I don’t plan to spend more time on this. If it isn’t ready, I’d suggest a doc-only deprecation of sparse series and sparse DataFrame. |
@TomAugspurger this looked fine, just a couple of questions above. (and merge master); |
@TomAugspurger just a few questions |
All green. I'll update #26137 once this is in. |
thanks @TomAugspurger |
Closes #25681