Skip to content

PERF: Sparse Series to scipy COO sparse matrix #42925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Sep 5, 2021

Conversation

TLouf
Copy link
Contributor

@TLouf TLouf commented Aug 7, 2021

I made the test more thorough, as it only checked the output's type, but not that the output matrix was actually the expected one.

Here is the output from the asv benchmark:

       before           after         ratio
     [e045034e]       [042dbbc3]
     <master>         <sseries-to-coo-perf>
-      12.6±0.6ms       2.35±0.1ms     0.19  sparse.ToCoo.time_sparse_series_to_coo_single_level(False)
-      19.9±0.6ms      3.68±0.05ms     0.19  sparse.ToCoo.time_sparse_series_to_coo(True)
-      19.6±0.4ms      3.51±0.09ms     0.18  sparse.ToCoo.time_sparse_series_to_coo(False)
-        12.5±1ms         220±10μs     0.02  sparse.ToCoo.time_sparse_series_to_coo_single_level(True)

As you can see the performance improvement is the most dramatic when there is only one row and column level, and for an output with sorted labels. This is simply due to the fact that the labels are sorted in the levels attribute of an Index, so I covered this case, which should be the most common, with a simpler (and faster) implementation. For this reason I was thinking maybe sort_labels could default to True, but that would probably require prior deprecation warnings and all. In any case I can add this information as a performance tip to the docs, let me know what you think.

@jreback jreback added Performance Memory or execution speed performance Sparse Sparse Data Type labels Aug 7, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew note for 1.4 perf section. will have to look again

@TLouf
Copy link
Contributor Author

TLouf commented Aug 31, 2021

@jreback I think everything's ok now, the failing checks seem unrelated.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have an asv for sorted_labels=False but row_level/column_levels a single level? e.g. the case you mention in the doc-string?

return ax_coords, ax_labels


def _to_ijv(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this called? ideally prefer to return a dataclass / namedtuple if possible

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place it's called is in sparse_series_to_coo https://github.com/pandas-dev/pandas/pull/42925/files#diff-29d84e278af1528165388e964717fd13f9cabeb155887550b9d2613579d52b65L110, and the first three returns are fed directly to scipy.sparse.coo_matrix. Do you still think it's preferable to have this return a namedtuple?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its ok, makes the code hard to read though (but prob not worth changing)

@TLouf
Copy link
Contributor Author

TLouf commented Sep 1, 2021

The one for sort_labels=False but row_level/column_levels a single level is the first one reported below:

       before           after         ratio
     [e045034e]       [621b1028]
     <master>         <sseries-to-coo-perf>
-      11.8±0.9ms      2.17±0.05ms     0.18  sparse.ToCoo.time_sparse_series_to_coo_single_level(False)
-        21.2±1ms       3.29±0.5ms     0.16  sparse.ToCoo.time_sparse_series_to_coo(True)
-      19.0±0.5ms       2.78±0.1ms     0.15  sparse.ToCoo.time_sparse_series_to_coo(False)
-      12.5±0.9ms         200±10μs     0.02  sparse.ToCoo.time_sparse_series_to_coo_single_level(True)

the one I mention in the docstring, which is the last one above and has better performance, is for sort_labels=True.

@jreback jreback added this to the 1.4 milestone Sep 1, 2021
@jreback
Copy link
Contributor

jreback commented Sep 1, 2021

@TLouf lgtm.

@pandas-dev/pandas-core if any comments.

@jreback jreback merged commit 37bd4dc into pandas-dev:master Sep 5, 2021
@jreback
Copy link
Contributor

jreback commented Sep 5, 2021

thanks @TLouf very nice!

@TLouf TLouf deleted the sseries-to-coo-perf branch September 5, 2021 10:21
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: sparse_series_to_coo performance
2 participants