PERF: Speed up Spearman calculation #28151

dsaxton · 2019-08-26T14:42:49Z

Saves off ranks for later reuse once they're calculated during the computation of DataFrame.corr(method="spearman") so as to speed up the calculation.

~~This PR ranks data outside the nested loop in nancorr_spearman so as to speed up the calculation~~ (closes #28139). The performance benchmarks show a nice improvement:

asv continuous -f 1.1 upstream/master corr-perf -b stat_ops.Correlation

       before           after         ratio
     [5d9fd7e3]       [08d8e394]
     <master>         <corr-perf>
-      87.8±0.9ms       7.81±0.2ms     0.09  stat_ops.Correlation.time_corr('spearman', False)
-        88.0±1ms       7.82±0.1ms     0.09  stat_ops.Correlation.time_corr('spearman', True)

(One caveat here is that when two columns have different missing patterns we actually need to recompute ranks over the rows where both are non-missing, so in some situations this will actually perform slightly worse than the current implementation.)

WillAyd · 2019-08-26T14:46:50Z

Thanks for the PR - can you provide a code sample for the caveat mentioned? Not totally sure I follow

jbrockmendel · 2019-08-26T15:00:55Z

does the memory footprint take a hit?

dsaxton · 2019-08-26T15:04:01Z

Thanks for the PR - can you provide a code sample for the caveat mentioned? Not totally sure I follow

Sure, if we rank first then compute the correlation a difference in missing patterns can cause distortions like in this example (where the rank correlation should be one):

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1.0, 2.0, 3.0, 4.0, 5.0],
...                    "b": [1.0, np.nan, np.nan, 2.0, 3.0]})
>>> df
     a    b
0  1.0  1.0
1  2.0  NaN
2  3.0  NaN
3  4.0  2.0
4  5.0  3.0
>>> df.corr()
          a         b
a  1.000000  0.960769
b  0.960769  1.000000
>>> df[~df.isnull().max(axis=1)]
     a    b
0  1.0  1.0
3  4.0  2.0
4  5.0  3.0
>>> df[~df.isnull().max(axis=1)].rank().corr()
     a    b
a  1.0  1.0
b  1.0  1.0

I think this is the reason why the ranking is done inside the loop, but we want to avoid this step and rely on precalculated ranks if we can.

dsaxton · 2019-08-26T15:30:22Z

does the memory footprint take a hit?

I think it would since we're creating a whole new array of ranks. I suppose we could throw away mat once it's been ranked?

jbrockmendel · 2019-08-26T16:12:17Z

I guess better phrasing would have been "is the memory footprint impact big enough to care about?"

pandas/_libs/algos.pyx

dsaxton · 2019-08-26T18:06:49Z

I guess better phrasing would have been "is the memory footprint impact big enough to care about?"

It does basically double memory usage (ex.py runs nancorr_spearman on a 5000 by 100 array of random floats):

(base) dsaxton> python3 -m memory_profiler ex.py 
Filename: ex.py

Line #    Mem usage    Increment   Line Contents
================================================
     9     59.2 MiB     59.2 MiB   @profile
    10                             def func():
    11     59.9 MiB      0.7 MiB       return nancorr_spearman(X)


(pandas-dev) dsaxton> python3 -m memory_profiler ex.py 
Filename: ex.py

Line #    Mem usage    Increment   Line Contents
================================================
     9     89.6 MiB     89.6 MiB   @profile
    10                             def func():
    11     90.3 MiB      0.7 MiB       return nancorr_spearman(X)

gfyoung · 2019-08-26T21:07:50Z

so in some situations this will actually perform slightly worse than the current implementation

Is there a good heuristic that we can use to check this?

dsaxton · 2019-08-26T22:31:57Z

so in some situations this will actually perform slightly worse than the current implementation

Is there a good heuristic that we can use to check this?

The worst case would be when every column has a different missing pattern and the first step of precalculating is for naught, and in this case you do an extra k sorts. I think this is a good tradeoff since in the best case you do k instead of k**2 sorts.

When I run that scenario on this branch I do get worse performance (4.74 seconds vs. 4.49 seconds using a 5000 by 100 array), but it's actually almost within a margin of error of the current implementation.

WillAyd · 2019-08-27T21:30:29Z

Took me a little big to grasp but you mean't by "missing pattern" but thinking I get it now wouldn't it be more likely for the NA value locations to vary by columns? If so probably want to rethink approach here

dsaxton · 2019-08-27T22:55:57Z

Took me a little big to grasp but you mean't by "missing pattern" but thinking I get it now wouldn't it be more likely for the NA value locations to vary by columns? If so probably want to rethink approach here

Hmm, fair point. Since we actually know whether this is true for any pair of columns before we calculate the ranks in the nested loop, we could actually check this condition before deciding if we want to save off the full set of ranks and avoid any extra calculation. The only cost would be the additional memory. I'll try to make an update with that change soon.

dsaxton · 2019-08-28T00:11:23Z

@WillAyd Just pushed a commit that avoids precalculating any of the ranks. Let me know if this is getting a bit too confusing and I can try to rewrite.

Edit: Also reran the benchmarks and still seeing the same speedup.

WillAyd · 2019-08-30T18:00:23Z

Can you add a benchmark that injects missing data into the calculation? I think it would help frame the discussion around performance a little better if that's something that needs to be accounted for

WillAyd · 2019-08-30T18:03:07Z

pandas/_libs/algos.pyx

@@ -307,26 +308,44 @@ def nancorr_spearman(const float64_t[:, :] mat, Py_ssize_t minp=1):
    result = np.empty((K, K), dtype=np.float64)
    mask = np.isfinite(mat).view(np.uint8)

+    ranked_mat = np.empty((N, K), dtype=np.float64)
+    cached_ranks = set({})


IIUC the whole caching thing you have here is to avoid the memory overhead of ranking columns up front right? If so I think it's a premature optimization so would probably just prefer if everything was ranked and we didn't do caching.

So this approach would accept a hit on memory by preallocating ranked_mat (even this is probably not necessary), but save time by not calculating ranks up front that we end up not reusing. Instead we wait until we need a full set of ranks for some calculation before computing and saving them.

How much of a time savings is that? In any case your PR should offer a significant speed up by not ranking in each loop so I think this might still be OK.

It was actually fairly negligible based on some local testing I did (maybe 5% slowdown), but it's probably worthwhile to add some more cases to stat_ops.py (wide tables, possibly a worst case missing pattern for this change, etc.)

Did you see any perf difference here? Again would rather simplify and not have this if there's no major perf impact

I'm actually seeing essentially the same performance from the previous implementation (ranking everything up front) on the "wide with nans" benchmark, so maybe just revert to that approach since the code is a lot easier to understand?

jreback

do the existing benchmarks cover enough here? (e.g. enough matrix sizes)
pls add a perf note

dsaxton · 2019-09-04T13:41:27Z

Added a couple benchmarks, one with wide data (1000 by 500), wide data with random nans, and a memory benchmark on the wide data. Here's the summary:

       before           after         ratio
     [498f3008]       [420b1d6a]
     <master>         <corr-perf>
-        84.1±2ms       7.86±0.5ms     0.09  stat_ops.Correlation.time_corr('spearman', False)
-      83.1±0.5ms       7.57±0.4ms     0.09  stat_ops.Correlation.time_corr('spearman', True)
-       23.3±0.1s       1.86±0.05s     0.08  stat_ops.Correlation.time_corr_wide('spearman', True)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Couple things that didn't make it into the summary but are interesting:

master

(Not directly relevant to this pull request, but Kendall's tau fails on the wide data. It's not shown here, but it fails without nans as well.)

[ 75.00%] · For pandas commit 498f3008 <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 78.57%] ··· stat_ops.Correlation.peakmem_corr_wide                                                                                                                     ok
[ 78.57%] ··· ========== ======= =======
              --          use_bottleneck
              ---------- ---------------
                method     True   False 
              ========== ======= =======
               spearman   82.2M   82.2M 
               kendall    98.7M   98.7M 
               pearson    81.8M   81.8M 
              ========== ======= =======

[ 92.86%] ··· stat_ops.Correlation.time_corr_wide_nans                                                                                                           2/6 failed
[ 92.86%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman    20.1±0s     20.0±0.02s 
               kendall      failed       failed   
               pearson    1.43±0.02s   1.55±0.06s 
              ========== ============ ============

corr-perf

Surprisingly the peak memory usage is not much worse than master. If we add random null values to the data then the time performance is the same as master.

[ 50.00%] · For pandas commit 420b1d6a <corr-perf> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 53.57%] ··· stat_ops.Correlation.peakmem_corr_wide                                                                                                                     ok
[ 53.57%] ··· ========== ======= =======
              --          use_bottleneck
              ---------- ---------------
                method     True   False 
              ========== ======= =======
               spearman    86M     86M  
               kendall    98.7M   98.7M 
               pearson    81.7M   81.6M 
              ========== ======= =======

[ 67.86%] ··· stat_ops.Correlation.time_corr_wide_nans                                                                                                           2/6 failed
[ 67.86%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman   20.1±0.1s     20.0±0s   
               kendall      failed       failed   
               pearson    1.42±0.02s   1.41±0.03s 
              ========== ============ ============

WillAyd · 2019-09-05T15:42:25Z

You might need to skip the Kendall method in benchmarks where failing to get CI to pass. can you open up a follow up issue to address?

dsaxton · 2019-09-05T16:04:01Z

You might need to skip the Kendall method in benchmarks where failing to get CI to pass. can you open up a follow up issue to address?

Is there an easy way to skip or xfail a benchmark in asv?

WillAyd · 2019-09-05T16:14:26Z

Try raising a NotImplementedError

This reverts commit 14e7f51.

dsaxton · 2019-09-06T13:06:52Z

Try raising a NotImplementedError

I was seeing the benchmark still fail after doing this. Feels like a hack, but how about returning None (or maybe making the data just small enough so they all pass)?

dsaxton · 2019-09-06T23:32:56Z

@WillAyd I lowered the dimensions of the wide benchmarks to 1000 by 300 and nothing fails now; also reverted to the simpler implementation since the performance was about the same. Updated asv results:

corr-perf

[ 54.17%] ··· stat_ops.Correlation.time_corr                                                                                                                             ok
[ 54.17%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman   7.68±0.6ms   8.13±0.4ms 
               kendall     193±8ms      193±5ms   
               pearson    2.04±0.1ms   1.99±0.2ms 
              ========== ============ ============

[ 62.50%] ··· stat_ops.Correlation.time_corr_wide                                                                                                                        ok
[ 62.50%] ··· ========== =========== ===========
              --              use_bottleneck    
              ---------- -----------------------
                method       True       False   
              ========== =========== ===========
               spearman    666±30ms    674±20ms 
               kendall    20.0±0.1s   19.7±0.1s 
               pearson     429±30ms    456±20ms 
              ========== =========== ===========

[ 66.67%] ··· stat_ops.Correlation.time_corr_wide_nans                                                                                                                   ok
[ 66.67%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman   7.29±0.06s   7.72±0.07s 
               kendall    20.5±0.4s    20.2±0.7s  
               pearson     561±8ms      562±30ms  
              ========== ============ ============

master

[ 79.17%] ··· stat_ops.Correlation.time_corr                                                                                                                             ok
[ 79.17%] ··· ========== ============= ============
              --               use_bottleneck      
              ---------- --------------------------
                method        True        False    
              ========== ============= ============
               spearman     90.7±2ms     90.2±3ms  
               kendall     193±0.9ms     195±3ms   
               pearson    1.90±0.07ms   1.88±0.1ms 
              ========== ============= ============

[ 87.50%] ··· stat_ops.Correlation.time_corr_wide                                                                                                                        ok
[ 87.50%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman   8.43±0.08s   8.46±0.1s  
               kendall    19.3±0.03s   19.4±0.06s 
               pearson     438±20ms     445±20ms  
              ========== ============ ============

[ 91.67%] ··· stat_ops.Correlation.time_corr_wide_nans                                                                                                                   ok
[ 91.67%] ··· ========== ============ ===========
              --              use_bottleneck     
              ---------- ------------------------
                method       True        False   
              ========== ============ ===========
               spearman   7.23±0.01s   7.55±0.1s 
               kendall    19.2±0.08s   19.0±0.2s 
               pearson     505±20ms     532±20ms 
              ========== ============ ===========

jreback · 2019-09-06T23:59:33Z

pls decrease the size even more

20s is too long

dsaxton · 2019-09-07T01:32:15Z

pls decrease the size even more

20s is too long

Updated to 200 columns, longest benchmark takes ~8 seconds

jreback · 2019-09-07T17:56:57Z

lgtm. over to you @WillAyd

@dsaxton could even reduce the benchmark time further as the relative statistics are still telling. (and we have a ton of benchmarks; it already takes 1 hour + to run them all...)

WillAyd · 2019-09-07T17:59:47Z

Thanks @dsaxton very nice PR. Would certainly love any follow ups or improvements you can offer!

Daniel Saxton added 2 commits August 25, 2019 20:07

Rank outside loop

9f8e3cf

Check for same missing pattern

08d8e39

WillAyd added the Performance Memory or execution speed performance label Aug 26, 2019

WillAyd requested changes Aug 26, 2019

View reviewed changes

pandas/_libs/algos.pyx Show resolved Hide resolved

pandas/_libs/algos.pyx Outdated Show resolved Hide resolved

Add comment

1118a3c

Avoid unnecessary ranking

14e7f51

Rename same_miss_pat

3169996

WillAyd reviewed Aug 30, 2019

View reviewed changes

jreback requested changes Sep 2, 2019

View reviewed changes

Daniel Saxton added 4 commits September 3, 2019 19:33

Add wide frame to benchmarks

2340226

Add performance note

913193d

Add peakmem benchmark

30cfc38

Add nans to benchmark data

420b1d6

Merge branch 'master' into corr-perf

6a8e809

Daniel Saxton added 4 commits September 5, 2019 17:50

Revert "Avoid unnecessary ranking"

15f70e2

This reverts commit 14e7f51.

Edit comment

ae3e338

Raise for Kendall benchmark

01cb456

Try returning None

3e2da44

Fix benchmark

738433a

dsaxton mentioned this pull request Sep 6, 2019

DataFrame.corr(method="kendall") calculation is slow #28329

Closed

Update benchmark data size

92d8134

jreback added this to the 1.0 milestone Sep 7, 2019

jreback approved these changes Sep 7, 2019

View reviewed changes

WillAyd approved these changes Sep 7, 2019

View reviewed changes

WillAyd merged commit 7111927 into pandas-dev:master Sep 7, 2019

dsaxton deleted the corr-perf branch September 8, 2019 01:20

This was referenced Sep 10, 2019

df.corr Spearman method performance #28366

Closed

PERF: Spearman correlation #14239

Closed

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

PERF: Speed up Spearman calculation (pandas-dev#28151)

4efa898

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

PERF: Speed up Spearman calculation (pandas-dev#28151)

eb51c8f

oyvindeide mentioned this pull request Jan 2, 2020

Long run time with seismic data equinor/semeio#13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Speed up Spearman calculation #28151

PERF: Speed up Spearman calculation #28151

dsaxton commented Aug 26, 2019 •

edited

Loading

WillAyd commented Aug 26, 2019

jbrockmendel commented Aug 26, 2019

dsaxton commented Aug 26, 2019

dsaxton commented Aug 26, 2019

jbrockmendel commented Aug 26, 2019

dsaxton commented Aug 26, 2019

gfyoung commented Aug 26, 2019

dsaxton commented Aug 26, 2019

WillAyd commented Aug 27, 2019

dsaxton commented Aug 27, 2019

dsaxton commented Aug 28, 2019 •

edited

Loading

WillAyd commented Aug 30, 2019

WillAyd Aug 30, 2019

dsaxton Aug 30, 2019 •

edited

Loading

WillAyd Sep 3, 2019

dsaxton Sep 3, 2019

WillAyd Sep 5, 2019

dsaxton Sep 5, 2019

jreback left a comment

dsaxton commented Sep 4, 2019 •

edited

Loading

WillAyd commented Sep 5, 2019

dsaxton commented Sep 5, 2019

WillAyd commented Sep 5, 2019

dsaxton commented Sep 6, 2019

dsaxton commented Sep 6, 2019

jreback commented Sep 6, 2019

dsaxton commented Sep 7, 2019

jreback commented Sep 7, 2019

WillAyd commented Sep 7, 2019

PERF: Speed up Spearman calculation #28151

PERF: Speed up Spearman calculation #28151

Conversation

dsaxton commented Aug 26, 2019 • edited Loading

WillAyd commented Aug 26, 2019

jbrockmendel commented Aug 26, 2019

dsaxton commented Aug 26, 2019

dsaxton commented Aug 26, 2019

jbrockmendel commented Aug 26, 2019

dsaxton commented Aug 26, 2019

gfyoung commented Aug 26, 2019

dsaxton commented Aug 26, 2019

WillAyd commented Aug 27, 2019

dsaxton commented Aug 27, 2019

dsaxton commented Aug 28, 2019 • edited Loading

WillAyd commented Aug 30, 2019

WillAyd Aug 30, 2019

Choose a reason for hiding this comment

dsaxton Aug 30, 2019 • edited Loading

Choose a reason for hiding this comment

WillAyd Sep 3, 2019

Choose a reason for hiding this comment

dsaxton Sep 3, 2019

Choose a reason for hiding this comment

WillAyd Sep 5, 2019

Choose a reason for hiding this comment

dsaxton Sep 5, 2019

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

dsaxton commented Sep 4, 2019 • edited Loading

WillAyd commented Sep 5, 2019

dsaxton commented Sep 5, 2019

WillAyd commented Sep 5, 2019

dsaxton commented Sep 6, 2019

dsaxton commented Sep 6, 2019

jreback commented Sep 6, 2019

dsaxton commented Sep 7, 2019

jreback commented Sep 7, 2019

WillAyd commented Sep 7, 2019

dsaxton commented Aug 26, 2019 •

edited

Loading

dsaxton commented Aug 28, 2019 •

edited

Loading

dsaxton Aug 30, 2019 •

edited

Loading

dsaxton commented Sep 4, 2019 •

edited

Loading