PERF: optimize DataFrame.sparse.from_spmatrix performance #32825

rth · 2020-03-19T11:59:13Z

This optimizes DataFrame.sparse.from_spmatrix performance, using an approach proposed by @jorisvandenbossche in #32196 (comment)

Bulds on top of #32821 and adds another ~4.5x speed up in addition to that PR. Benchmarks for run time (in seconds) are done by running pd.DataFrame.sparse.from_spmatrix on a random sparse CSR array of given n_samples, n_features with a density=0.01:

label                     PR   master Speed-up master/PR                                                                               
n_samples n_features                                    
100       100000      2.3624  10.1247              4.29x
10000     10000       0.2391   1.1037              4.62x
100000    100         0.0031   0.0134              4.32x

with the benchmarking code below,

import pandas as pd
import numpy as np
import scipy.sparse

from neurtu import timeit, delayed


def bench_cases():
    for n_samples, n_features in [(100, 100000), (10000, 10000), (100000, 100)]:
        X = scipy.sparse.rand(
            n_samples, n_features, random_state=0, density=0.01, format="csr"
        )
        tags = {"n_samples": n_samples, "n_features": n_features, "label": "PR"}

        yield delayed(pd.DataFrame.sparse.from_spmatrix, tags=tags.copy())(X)


res = timeit(bench_cases())

res = (
    res.reset_index()
    .set_index(["n_samples", "n_features", "label"])["wall_time"]
    .unstack(-1)[["PR"]]
    .round(4)
)
# res["master"] = np.array([10.1247, 1.1037, 0.0134])

# res["Speed-up master/PR"] = (res["master"] / res["PR"]).round(2).astype("str") + "x"
print(res)

Closes #32196 although further optimization might be possible. Around 90% of remaining run time happens in DataFrame._from_arrays which goes deeper into pandas internals. Maybe some checks could be disabled there, but that looks less straightforward.

rth · 2020-03-19T12:06:29Z

pandas/core/arrays/sparse/accessor.py

+            sl = slice(indptr[i], indptr[i + 1])
+            idx = IntIndex(n_rows, indices[sl], check_integrity=False)
+            arr = SparseArray._simple_new(data[sl], idx, dtype)
+            arrays.append(arr)


FWIW, also tried with a generator here to avoid pre-allocating all the arrays, but it doesn't really matter. Most of the remaining run time is in DataFrame._from_arrays.

rth · 2020-03-19T12:08:14Z

pandas/core/arrays/sparse/accessor.py

-        result.columns = columns
-        return result
+        n_rows, n_columns = data.shape
+        data.sort_indices()


It might be already done in tocsc, but that's a scipy implementation detail, and it doesn't really cost much. We need to make sure indices are sorted, since we create IntIndex with check_integrity=False that used to check for this.

Maybe add a comment about that inline?

rth · 2020-03-19T12:08:53Z

pandas/core/arrays/sparse/accessor.py

@@ -227,15 +227,24 @@ def from_spmatrix(cls, data, index=None, columns=None):
        1  0.0  1.0  0.0
        2  0.0  0.0  1.0
        """
-        from pandas import DataFrame
+        from pandas import DataFrame, SparseDtype
+        from . import IntIndex, SparseArray

        data = data.tocsc()


Could be tocsc(copy=False) for newer scipy versions, but the gain in performance is minimal anyway with respect to the total runtime.

jreback

pls
add an asv with a fixed random seed

TomAugspurger

Thanks, looks nice! The asv would go in benchmarks/sparse.py.

doc/source/whatsnew/v1.1.0.rst

Co-Authored-By: Tom Augspurger <[email protected]>

rth · 2020-03-19T13:37:27Z

pls add an asv with a fixed random seed

There is actually already an asv benchmark for this,

$ asv continuous --bench SparseDataFrameConstructor upstream/master HEAD
[...]
[100.00%] ··· sparse.SparseDataFrameConstructor.time_from_scipy                                                                 117±1ms
       before           after         ratio
     [dbd7a5d3]       [508fda51]
     <master>         <opt-sparse-init>
-         117±1ms      26.8±0.07ms     0.23  sparse.SparseDataFrameConstructor.time_from_scipy

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

that shows comparable improvement to the benchmarks I have done earlier in the issue description.

TomAugspurger

Thanks for confirming that we have an existing benchmark for this.

jorisvandenbossche · 2020-03-19T13:56:09Z

Around 90% of remaining run time happens in DataFrame._from_arrays which goes deeper into pandas internals. Maybe some checks could be disabled there, but that looks less straightforward.

Yeah, it's not really optimized for the case you know you have all well-behaved arrays you just want to put in a DataFrame.
Investigating this a bit, one factor seems to be the unnecessary consolidation -> #32826

pandas/core/arrays/sparse/accessor.py

WillAyd

minor nits but otherwise lgtm

pandas/_libs/sparse.pyx

WillAyd · 2020-03-19T18:58:09Z

pandas/core/arrays/sparse/accessor.py

+        data.sort_indices()
+        indices = data.indices
+        indptr = data.indptr
+        data = data.data


Can you assign to a different variable? This in theory could cause mypy complaints (though ndarray resolving to Any at the moment probably allows it to pass)

Co-Authored-By: William Ayd <[email protected]>

rth · 2020-03-19T20:25:15Z

Thanks @WillAyd , I addressed your comments.

"Web and docs" CI job failed with the following, but I don't think it's due to this PR,

 Check ipython directive errors
0s
##[error]Process completed with exit code 1.
Run ! grep -B1 "^<<<-------------------------------------------------------------------------$" sphinx.log
  FutureWarning,
<<<-------------------------------------------------------------------------
##[error]Process completed with exit code 1.

jorisvandenbossche · 2020-03-19T20:32:48Z

It's not very visible, but the actual error is :

 >>>-------------------------------------------------------------------------
Warning in /home/runner/work/pandas/pandas/doc/source/user_guide/scale.rst at block ending on line 254
Specify :okwarning: as an option in the ipython:: block to suppress this message
----------------------------------------------------------------------------
/home/runner/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/fsspec/implementations/local.py:33: FutureWarning: The default value of auto_mkdir=True has been deprecated and will be changed to auto_mkdir=False by default in a future release.
  FutureWarning,
<<<-------------------------------------------------------------------------

so indeed not from this PR.

jorisvandenbossche · 2020-03-19T20:35:07Z

See #32832

jorisvandenbossche · 2020-03-20T09:52:13Z

pandas/core/arrays/sparse/accessor.py

+            idx = IntIndex(n_rows, indices[sl], check_integrity=False)
+            arr = SparseArray._simple_new(array_data[sl], idx, dtype)
+            arrays.append(arr)
+        return DataFrame._from_arrays(arrays, columns=columns, index=index)


This line will be able to use the verify_integrity=False from #32858 (or if this PR goes in first, I can add it in that PR)

rth · 2020-03-20T12:01:09Z

@jreback Do you have any other comments on this? I addressed your review comment.

jorisvandenbossche · 2020-03-20T20:07:01Z

@rth I merged #32858, so you can update this now to use the verify_integrity=False

jorisvandenbossche · 2020-03-20T20:09:42Z

asv_bench/benchmarks/sparse.py

@@ -45,8 +45,7 @@ def time_sparse_array(self, dense_proportion, fill_value, dtype):
 class SparseDataFrameConstructor:
    def setup(self):
        N = 1000
-        self.arr = np.arange(N)
-        self.sparse = scipy.sparse.rand(N, N, 0.005)
+        self.sparse = scipy.sparse.rand(N, N, 0.005, random_state=0)


given what we said in the other PR, you can remove this random state again (it's covered by the global setup function which will be run for every benchmark)

rth · 2020-03-20T20:21:08Z

Thanks @jorisvandenbossche . After merging in #32858 the benchmarks results are

$ asv continuous --bench SparseDataFrameConstructor upstream/master HEAD
...
[100.00%] ··· sparse.SparseDataFrameConstructor.time_from_scipy                                                               104±0.3ms
       before           after         ratio
     [3b406a32]       [e063c8a5]
     <opt-sparse-init~2^2>       <opt-sparse-init>
-       104±0.3ms       12.0±0.3ms     0.12  sparse.SparseDataFrameConstructor.time_from_scipy

(previous results in #32825 (comment))

jreback

lgtm. pls merge master and ping on green.

jreback · 2020-03-19T12:31:23Z

doc/source/whatsnew/v1.1.0.rst

@@ -224,6 +224,9 @@ Performance improvements
 - The internal index method :meth:`~Index._shallow_copy` now copies cached attributes over to the new index,
  avoiding creating these again on the new index. This can speed up many operations that depend on creating copies of
  existing indexes (:issue:`28584`, :issue:`32640`, :issue:`32669`)
+- Performance improvement when creating sparse :class:`DataFrame` from
+  ``scipy.sparse`` matrices using the :meth:`DataFrame.sparse.from_spmatrix`
+  constructor (:issue:`32196`).


add the issue number from joris PR as well

Added PRs by Joris, there have been 5 PRs on this in total.

Ideally another what's new entry should be added since PR's by @jorisvandenbossche also made initialization of extension arrays faster under some conditions, as far as I understand. Though I would rather not add it here, nor am I competent on accurately formulating it.

rth · 2020-03-22T10:11:28Z

With #32856 merged, there is a new failure for non unique column names. In particular, TestFrameAccessor::test_from_spmatrix_columns[columns2] calling,

pd.DataFrame.sparse.from_spmatrix(mat, columns=['a', 'a'])

fails with,

pandas/core/arrays/sparse/accessor.py:251: in from_spmatrix
    return DataFrame._from_arrays(
pandas/core/frame.py:1919: in _from_arrays
    mgr = arrays_to_mgr(
pandas/core/internals/construction.py:77: in arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
pandas/core/internals/managers.py:1689: in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

arrays = [[0, 0.7151164908889105, 0, 0, 0.7790091888007832, 0.8726851819885113, 0.8634012065028459, 0, 0, 0.6582977581693065]
F...13, 0, 0, 0.6245750826925724, 0, 0, 0.6870139075931206]
Fill: 0
IntIndex
Indices: array([1, 2, 3, 6, 9], dtype=int32)
]
names = ['a', 'a'], axes = [['a', 'a'], RangeIndex(start=0, stop=10, step=1)]

    def form_blocks(arrays, names, axes):
        # put "leftover" items in float bucket, where else?
        # generalize?
        items_dict = defaultdict(list)
        extra_locs = []
    
        names_idx = ensure_index(names)
        if names_idx.equals(axes[0]):
            names_indexer = np.arange(len(names_idx))
        else:
>           assert names_idx.intersection(axes[0]).is_unique
E           AssertionError

any suggestions @jorisvandenbossche ?

jorisvandenbossche · 2020-03-22T10:47:28Z

Hmm, it's strange that it would be due to #32856, because it's failing inside form_blocks before the location that was changed in that PR. Maybe the verify_integrity=False causes it?

rth · 2020-03-22T10:52:25Z

Hmm, it's strange that it would be due to #32856, because it's failing inside form_blocks before the location that was changed in that PR. Maybe the verify_integrity=False causes it?

Yes, sorry, you are right. This failure was already present 2 days ago before I merged master today.

jorisvandenbossche · 2020-03-22T10:53:42Z

pandas/core/arrays/sparse/accessor.py

@@ -228,14 +228,29 @@ def from_spmatrix(cls, data, index=None, columns=None):
        2  0.0  0.0  1.0
        """
        from pandas import DataFrame
+        from pandas._libs.sparse import IntIndex

        data = data.tocsc()
        index, columns = cls._prep_index(data, index, columns)


I think the problem is that this _pre_index doesn't actually return Index objects when columns or index is not None (and passing actual Index objects is required when doing verify_integrity=False)

OK, I see. Thanks for confirming that passing duplicate columns names in an Index object is expected to work in _from_arrays. Will look into it. Thanks Joris!

That seems to be it, as doing pd.DataFrame.sparse.from_spmatrix(mat, columns=pd.Index(['a', 'a'])) instead of pd.DataFrame.sparse.from_spmatrix(mat, columns=['a', 'a']) works.

You can add in here a ensure_index call on both index and columns. (from pandas.core.indexes.api import ensure_index)

rth · 2020-03-22T16:15:24Z

Thanks @jorisvandenbossche and @jreback ! CI is green now.

jreback · 2020-03-22T20:33:51Z

thanks @rth very nice

…#32825)

jorisvandenbossche and others added 5 commits March 19, 2020 08:21

PERF: fix SparseArray._simple_new object initialization

3ffe7f2

Initial implementation

93bf825

Merge remote-tracking branch 'upstream/master' into opt-sparse-init

e21701b

Improve docstring

e095e7f

Add what's new

11afe40

rth mentioned this pull request Mar 19, 2020

PERF: fix SparseArray._simple_new object initialization #32821

Merged

simonjayhawkins added Performance Memory or execution speed performance Sparse Sparse Data Type labels Mar 19, 2020

rth commented Mar 19, 2020

View reviewed changes

jreback requested changes Mar 19, 2020

View reviewed changes

TomAugspurger reviewed Mar 19, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

rth and others added 3 commits March 19, 2020 14:14

Update doc/source/whatsnew/v1.1.0.rst

eda0732

Co-Authored-By: Tom Augspurger <[email protected]>

Improve what's new

40f4cd6

Add random state

508fda5

TomAugspurger approved these changes Mar 19, 2020

View reviewed changes

simonjayhawkins added this to the 1.1 milestone Mar 19, 2020

Add inline comment about sort_indices

0053817

rth mentioned this pull request Mar 19, 2020

PERF: skip non-consolidatable blocks when checking consolidation #32826

Merged

WillAyd requested changes Mar 19, 2020

View reviewed changes

pandas/core/arrays/sparse/accessor.py Outdated Show resolved Hide resolved

Use absolute import

42792a3

WillAyd requested changes Mar 19, 2020

View reviewed changes

rth and others added 2 commits March 19, 2020 20:34

Update pandas/_libs/sparse.pyx

c8f7abc

Co-Authored-By: William Ayd <[email protected]>

Rename variable

e541b0d

WillAyd approved these changes Mar 19, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Mar 20, 2020

PERF: allow to skip validation/sanitization in DataFrame._from_arrays #32858

Merged

jorisvandenbossche reviewed Mar 20, 2020

View reviewed changes

Merge remote-tracking branch 'upstream/master' into opt-sparse-init

0a0bafc

jorisvandenbossche reviewed Mar 20, 2020

View reviewed changes

rth added 2 commits March 20, 2020 21:09

Use DataFrame._from_arrays(..., verify_integrity=False)

ce6619e

checkout upstream/master -- asv_bench/benchmarks/sparse.py

e063c8a

jreback approved these changes Mar 21, 2020

View reviewed changes

rth added 2 commits March 22, 2020 10:33

Merge remote-tracking branch 'upstream/master' into opt-sparse-init

8efe3ad

Link to PR's by Joris in what's new

20c3685

jorisvandenbossche reviewed Mar 22, 2020

View reviewed changes

Use ensure_index when the columns/index is provided by the user

d750eb4

jreback merged commit d6de714 into pandas-dev:master Mar 22, 2020

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020

PERF: optimize DataFrame.sparse.from_spmatrix performance (pandas-dev…

a88d24c

…#32825)

This was referenced Mar 25, 2020

Pandas wrongly read Scipy sparse matrix #29814

Closed

BUG: Fix wrong reading sparse matrix #31991

Merged

dwhswenson mentioned this pull request Jul 7, 2020

Fixes for sparse matrix in Pandas 1.0 dwhswenson/contact_map#69

Merged

Uh oh!

PERF: optimize DataFrame.sparse.from_spmatrix performance #32825

PERF: optimize DataFrame.sparse.from_spmatrix performance #32825

Uh oh!

Conversation

rth commented Mar 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth Mar 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rth commented Mar 19, 2020

Uh oh!

TomAugspurger left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Mar 19, 2020

Uh oh!

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Mar 19, 2020

Uh oh!

jorisvandenbossche commented Mar 19, 2020

Uh oh!

jorisvandenbossche commented Mar 19, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Mar 20, 2020

Uh oh!

jorisvandenbossche commented Mar 20, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Mar 20, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Mar 22, 2020

Uh oh!

jorisvandenbossche commented Mar 22, 2020

Uh oh!

rth commented Mar 22, 2020

Uh oh!

jorisvandenbossche Mar 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Mar 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth commented Mar 22, 2020

Uh oh!

jreback commented Mar 22, 2020

rth Mar 19, 2020 •

edited

Loading

jorisvandenbossche Mar 22, 2020 •

edited

Loading

jorisvandenbossche Mar 22, 2020 •

edited

Loading