-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Sparse Series to scipy COO sparse matrix #42925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The performance is improved overall, and most dramatically for a two-level MultiIndex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a whatsnew note for 1.4 perf section. will have to look again
…into sseries-to-coo-perf
@jreback I think everything's ok now, the failing checks seem unrelated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have an asv for sorted_labels=False but row_level/column_levels a single level? e.g. the case you mention in the doc-string?
return ax_coords, ax_labels | ||
|
||
|
||
def _to_ijv( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this called? ideally prefer to return a dataclass / namedtuple if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only place it's called is in sparse_series_to_coo
https://github.com/pandas-dev/pandas/pull/42925/files#diff-29d84e278af1528165388e964717fd13f9cabeb155887550b9d2613579d52b65L110, and the first three returns are fed directly to scipy.sparse.coo_matrix
. Do you still think it's preferable to have this return a namedtuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its ok, makes the code hard to read though (but prob not worth changing)
The one for sort_labels=False but row_level/column_levels a single level is the first one reported below:
the one I mention in the docstring, which is the last one above and has better performance, is for sort_labels=True. |
@TLouf lgtm. @pandas-dev/pandas-core if any comments. |
thanks @TLouf very nice! |
I made the test more thorough, as it only checked the output's type, but not that the output matrix was actually the expected one.
Here is the output from the asv benchmark:
As you can see the performance improvement is the most dramatic when there is only one row and column level, and for an output with sorted labels. This is simply due to the fact that the labels are sorted in the
levels
attribute of anIndex
, so I covered this case, which should be the most common, with a simpler (and faster) implementation. For this reason I was thinking maybesort_labels
could default toTrue
, but that would probably require prior deprecation warnings and all. In any case I can add this information as a performance tip to the docs, let me know what you think.