PERF: Sparse Series to scipy COO sparse matrix #42925

TLouf · 2021-08-07T17:36:25Z

closes ENH: sparse_series_to_coo performance #42880
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

I made the test more thorough, as it only checked the output's type, but not that the output matrix was actually the expected one.

Here is the output from the asv benchmark:

       before           after         ratio
     [e045034e]       [042dbbc3]
     <master>         <sseries-to-coo-perf>
-      12.6±0.6ms       2.35±0.1ms     0.19  sparse.ToCoo.time_sparse_series_to_coo_single_level(False)
-      19.9±0.6ms      3.68±0.05ms     0.19  sparse.ToCoo.time_sparse_series_to_coo(True)
-      19.6±0.4ms      3.51±0.09ms     0.18  sparse.ToCoo.time_sparse_series_to_coo(False)
-        12.5±1ms         220±10μs     0.02  sparse.ToCoo.time_sparse_series_to_coo_single_level(True)

As you can see the performance improvement is the most dramatic when there is only one row and column level, and for an output with sorted labels. This is simply due to the fact that the labels are sorted in the levels attribute of an Index, so I covered this case, which should be the most common, with a simpler (and faster) implementation. For this reason I was thinking maybe sort_labels could default to True, but that would probably require prior deprecation warnings and all. In any case I can add this information as a performance tip to the docs, let me know what you think.

The performance is improved overall, and most dramatically for a two-level MultiIndex

jreback

can you add a whatsnew note for 1.4 perf section. will have to look again

pandas/core/arrays/sparse/scipy_sparse.py

pandas/tests/arrays/sparse/test_array.py

pandas/core/arrays/sparse/scipy_sparse.py

…into sseries-to-coo-perf

TLouf · 2021-08-31T08:18:00Z

@jreback I think everything's ok now, the failing checks seem unrelated.

jreback

do we have an asv for sorted_labels=False but row_level/column_levels a single level? e.g. the case you mention in the doc-string?

jreback · 2021-08-31T23:07:27Z

pandas/core/arrays/sparse/scipy_sparse.py

+    return ax_coords, ax_labels
+
+
+def _to_ijv(


where is this called? ideally prefer to return a dataclass / namedtuple if possible

The only place it's called is in sparse_series_to_coo https://github.com/pandas-dev/pandas/pull/42925/files#diff-29d84e278af1528165388e964717fd13f9cabeb155887550b9d2613579d52b65L110, and the first three returns are fed directly to scipy.sparse.coo_matrix. Do you still think it's preferable to have this return a namedtuple?

its ok, makes the code hard to read though (but prob not worth changing)

TLouf · 2021-09-01T09:09:41Z

The one for sort_labels=False but row_level/column_levels a single level is the first one reported below:

       before           after         ratio
     [e045034e]       [621b1028]
     <master>         <sseries-to-coo-perf>
-      11.8±0.9ms      2.17±0.05ms     0.18  sparse.ToCoo.time_sparse_series_to_coo_single_level(False)
-        21.2±1ms       3.29±0.5ms     0.16  sparse.ToCoo.time_sparse_series_to_coo(True)
-      19.0±0.5ms       2.78±0.1ms     0.15  sparse.ToCoo.time_sparse_series_to_coo(False)
-      12.5±0.9ms         200±10μs     0.02  sparse.ToCoo.time_sparse_series_to_coo_single_level(True)

the one I mention in the docstring, which is the last one above and has better performance, is for sort_labels=True.

jreback · 2021-09-01T17:53:38Z

@TLouf lgtm.

@pandas-dev/pandas-core if any comments.

jreback · 2021-09-05T01:45:06Z

thanks @TLouf very nice!

TLouf added 4 commits August 7, 2021 19:08

PERF: sparse_series_to_coo improvement

cb82630

The performance is improved overall, and most dramatically for a two-level MultiIndex

Extend benchmark to two-level MultiIndex case

91560ea

Test more properly the output

042dbbc

Fix benchmark

db7ebf9

jreback added Performance Memory or execution speed performance Sparse Sparse Data Type labels Aug 7, 2021

jreback requested changes Aug 7, 2021

View reviewed changes

pandas/core/arrays/sparse/scipy_sparse.py Outdated Show resolved Hide resolved

pandas/core/arrays/sparse/scipy_sparse.py Outdated Show resolved Hide resolved

pandas/tests/arrays/sparse/test_array.py Outdated Show resolved Hide resolved

TLouf added 3 commits August 8, 2021 12:17

use pd.factorize

e09d760

parameterie test over sort_labels

55a7d32

add whatsnew entry

e013e82

jreback requested changes Aug 8, 2021

View reviewed changes

pandas/core/arrays/sparse/scipy_sparse.py Outdated Show resolved Hide resolved

pandas/core/arrays/sparse/scipy_sparse.py Show resolved Hide resolved

TLouf added 3 commits August 12, 2021 16:21

speedup of factorize with fast_zip

621b102

add type hints

1fa16cc

add docstring and comment on simple case

a5f093d

jreback requested changes Aug 13, 2021

View reviewed changes

pandas/core/arrays/sparse/scipy_sparse.py Show resolved Hide resolved

pandas/core/arrays/sparse/scipy_sparse.py Show resolved Hide resolved

TLouf and others added 6 commits August 23, 2021 12:47

add params and returns in docstrings

69f8651

Merge branch 'master' into sseries-to-coo-perf

e420490

add sort_labels performance tip in docstrings

14cb2cc

Merge branch 'sseries-to-coo-perf' of https://github.com/TLouf/pandas …

1c617a9

…into sseries-to-coo-perf

fix failing tests with future annotations import

7b2d13b

Merge branch 'master' into sseries-to-coo-perf

48bc2ec

trim trailing whitespace

2d4a496

jreback requested changes Aug 31, 2021

View reviewed changes

fix for sparse series with notna fill_value

6c0ab67

fix typo

2acfe46

jreback added this to the 1.4 milestone Sep 1, 2021

jreback approved these changes Sep 1, 2021

View reviewed changes

jreback merged commit 37bd4dc into pandas-dev:master Sep 5, 2021

TLouf deleted the sseries-to-coo-perf branch September 5, 2021 10:21

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

PERF: Sparse Series to scipy COO sparse matrix (pandas-dev#42925)

e192cb1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Sparse Series to scipy COO sparse matrix #42925

PERF: Sparse Series to scipy COO sparse matrix #42925

TLouf commented Aug 7, 2021 •

edited

Loading

jreback left a comment

TLouf commented Aug 31, 2021

jreback left a comment

jreback Aug 31, 2021

TLouf Sep 1, 2021

jreback Sep 1, 2021

TLouf commented Sep 1, 2021

jreback commented Sep 1, 2021

jreback commented Sep 5, 2021

PERF: Sparse Series to scipy COO sparse matrix #42925

PERF: Sparse Series to scipy COO sparse matrix #42925

Conversation

TLouf commented Aug 7, 2021 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

TLouf commented Aug 31, 2021

jreback left a comment

Choose a reason for hiding this comment

jreback Aug 31, 2021

Choose a reason for hiding this comment

TLouf Sep 1, 2021

Choose a reason for hiding this comment

jreback Sep 1, 2021

Choose a reason for hiding this comment

TLouf commented Sep 1, 2021

jreback commented Sep 1, 2021

jreback commented Sep 5, 2021

TLouf commented Aug 7, 2021 •

edited

Loading