Skip to content

PERF: Speed up Spearman calculation #28151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Sep 7, 2019
Merged

PERF: Speed up Spearman calculation #28151

merged 16 commits into from
Sep 7, 2019

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Aug 26, 2019

Saves off ranks for later reuse once they're calculated during the computation of DataFrame.corr(method="spearman") so as to speed up the calculation.

This PR ranks data outside the nested loop in nancorr_spearman so as to speed up the calculation (closes #28139). The performance benchmarks show a nice improvement:

asv continuous -f 1.1 upstream/master corr-perf -b stat_ops.Correlation

       before           after         ratio
     [5d9fd7e3]       [08d8e394]
     <master>         <corr-perf>
-      87.8±0.9ms       7.81±0.2ms     0.09  stat_ops.Correlation.time_corr('spearman', False)
-        88.0±1ms       7.82±0.1ms     0.09  stat_ops.Correlation.time_corr('spearman', True)

(One caveat here is that when two columns have different missing patterns we actually need to recompute ranks over the rows where both are non-missing, so in some situations this will actually perform slightly worse than the current implementation.)

@WillAyd
Copy link
Member

WillAyd commented Aug 26, 2019

Thanks for the PR - can you provide a code sample for the caveat mentioned? Not totally sure I follow

@WillAyd WillAyd added the Performance Memory or execution speed performance label Aug 26, 2019
@jbrockmendel
Copy link
Member

does the memory footprint take a hit?

@dsaxton
Copy link
Member Author

dsaxton commented Aug 26, 2019

Thanks for the PR - can you provide a code sample for the caveat mentioned? Not totally sure I follow

Sure, if we rank first then compute the correlation a difference in missing patterns can cause distortions like in this example (where the rank correlation should be one):

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1.0, 2.0, 3.0, 4.0, 5.0],
...                    "b": [1.0, np.nan, np.nan, 2.0, 3.0]})
>>> df
     a    b
0  1.0  1.0
1  2.0  NaN
2  3.0  NaN
3  4.0  2.0
4  5.0  3.0
>>> df.corr()
          a         b
a  1.000000  0.960769
b  0.960769  1.000000
>>> df[~df.isnull().max(axis=1)]
     a    b
0  1.0  1.0
3  4.0  2.0
4  5.0  3.0
>>> df[~df.isnull().max(axis=1)].rank().corr()
     a    b
a  1.0  1.0
b  1.0  1.0

I think this is the reason why the ranking is done inside the loop, but we want to avoid this step and rely on precalculated ranks if we can.

@dsaxton
Copy link
Member Author

dsaxton commented Aug 26, 2019

does the memory footprint take a hit?

I think it would since we're creating a whole new array of ranks. I suppose we could throw away mat once it's been ranked?

@jbrockmendel
Copy link
Member

I guess better phrasing would have been "is the memory footprint impact big enough to care about?"

@dsaxton
Copy link
Member Author

dsaxton commented Aug 26, 2019

I guess better phrasing would have been "is the memory footprint impact big enough to care about?"

It does basically double memory usage (ex.py runs nancorr_spearman on a 5000 by 100 array of random floats):

(base) dsaxton> python3 -m memory_profiler ex.py 
Filename: ex.py

Line #    Mem usage    Increment   Line Contents
================================================
     9     59.2 MiB     59.2 MiB   @profile
    10                             def func():
    11     59.9 MiB      0.7 MiB       return nancorr_spearman(X)


(pandas-dev) dsaxton> python3 -m memory_profiler ex.py 
Filename: ex.py

Line #    Mem usage    Increment   Line Contents
================================================
     9     89.6 MiB     89.6 MiB   @profile
    10                             def func():
    11     90.3 MiB      0.7 MiB       return nancorr_spearman(X)

@gfyoung
Copy link
Member

gfyoung commented Aug 26, 2019

so in some situations this will actually perform slightly worse than the current implementation

Is there a good heuristic that we can use to check this?

@dsaxton
Copy link
Member Author

dsaxton commented Aug 26, 2019

so in some situations this will actually perform slightly worse than the current implementation

Is there a good heuristic that we can use to check this?

The worst case would be when every column has a different missing pattern and the first step of precalculating is for naught, and in this case you do an extra k sorts. I think this is a good tradeoff since in the best case you do k instead of k**2 sorts.

When I run that scenario on this branch I do get worse performance (4.74 seconds vs. 4.49 seconds using a 5000 by 100 array), but it's actually almost within a margin of error of the current implementation.

@WillAyd
Copy link
Member

WillAyd commented Aug 27, 2019

Took me a little big to grasp but you mean't by "missing pattern" but thinking I get it now wouldn't it be more likely for the NA value locations to vary by columns? If so probably want to rethink approach here

@dsaxton
Copy link
Member Author

dsaxton commented Aug 27, 2019

Took me a little big to grasp but you mean't by "missing pattern" but thinking I get it now wouldn't it be more likely for the NA value locations to vary by columns? If so probably want to rethink approach here

Hmm, fair point. Since we actually know whether this is true for any pair of columns before we calculate the ranks in the nested loop, we could actually check this condition before deciding if we want to save off the full set of ranks and avoid any extra calculation. The only cost would be the additional memory. I'll try to make an update with that change soon.

@dsaxton
Copy link
Member Author

dsaxton commented Aug 28, 2019

@WillAyd Just pushed a commit that avoids precalculating any of the ranks. Let me know if this is getting a bit too confusing and I can try to rewrite.

Edit: Also reran the benchmarks and still seeing the same speedup.

@WillAyd
Copy link
Member

WillAyd commented Aug 30, 2019

Can you add a benchmark that injects missing data into the calculation? I think it would help frame the discussion around performance a little better if that's something that needs to be accounted for

@@ -307,26 +308,44 @@ def nancorr_spearman(const float64_t[:, :] mat, Py_ssize_t minp=1):
result = np.empty((K, K), dtype=np.float64)
mask = np.isfinite(mat).view(np.uint8)

ranked_mat = np.empty((N, K), dtype=np.float64)
cached_ranks = set({})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the whole caching thing you have here is to avoid the memory overhead of ranking columns up front right? If so I think it's a premature optimization so would probably just prefer if everything was ranked and we didn't do caching.

Copy link
Member Author

@dsaxton dsaxton Aug 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this approach would accept a hit on memory by preallocating ranked_mat (even this is probably not necessary), but save time by not calculating ranks up front that we end up not reusing. Instead we wait until we need a full set of ranks for some calculation before computing and saving them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of a time savings is that? In any case your PR should offer a significant speed up by not ranking in each loop so I think this might still be OK.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was actually fairly negligible based on some local testing I did (maybe 5% slowdown), but it's probably worthwhile to add some more cases to stat_ops.py (wide tables, possibly a worst case missing pattern for this change, etc.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see any perf difference here? Again would rather simplify and not have this if there's no major perf impact

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually seeing essentially the same performance from the previous implementation (ranking everything up front) on the "wide with nans" benchmark, so maybe just revert to that approach since the code is a lot easier to understand?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the existing benchmarks cover enough here? (e.g. enough matrix sizes)
pls add a perf note

@dsaxton
Copy link
Member Author

dsaxton commented Sep 4, 2019

Added a couple benchmarks, one with wide data (1000 by 500), wide data with random nans, and a memory benchmark on the wide data. Here's the summary:

       before           after         ratio
     [498f3008]       [420b1d6a]
     <master>         <corr-perf>
-        84.1±2ms       7.86±0.5ms     0.09  stat_ops.Correlation.time_corr('spearman', False)
-      83.1±0.5ms       7.57±0.4ms     0.09  stat_ops.Correlation.time_corr('spearman', True)
-       23.3±0.1s       1.86±0.05s     0.08  stat_ops.Correlation.time_corr_wide('spearman', True)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

Couple things that didn't make it into the summary but are interesting:

master

(Not directly relevant to this pull request, but Kendall's tau fails on the wide data. It's not shown here, but it fails without nans as well.)

[ 75.00%] · For pandas commit 498f3008 <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 78.57%] ··· stat_ops.Correlation.peakmem_corr_wide                                                                                                                     ok
[ 78.57%] ··· ========== ======= =======
              --          use_bottleneck
              ---------- ---------------
                method     True   False 
              ========== ======= =======
               spearman   82.2M   82.2M 
               kendall    98.7M   98.7M 
               pearson    81.8M   81.8M 
              ========== ======= =======

[ 92.86%] ··· stat_ops.Correlation.time_corr_wide_nans                                                                                                           2/6 failed
[ 92.86%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman    20.1±0s     20.0±0.02s 
               kendall      failed       failed   
               pearson    1.43±0.02s   1.55±0.06s 
              ========== ============ ============

corr-perf

Surprisingly the peak memory usage is not much worse than master. If we add random null values to the data then the time performance is the same as master.

[ 50.00%] · For pandas commit 420b1d6a <corr-perf> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 53.57%] ··· stat_ops.Correlation.peakmem_corr_wide                                                                                                                     ok
[ 53.57%] ··· ========== ======= =======
              --          use_bottleneck
              ---------- ---------------
                method     True   False 
              ========== ======= =======
               spearman    86M     86M  
               kendall    98.7M   98.7M 
               pearson    81.7M   81.6M 
              ========== ======= =======

[ 67.86%] ··· stat_ops.Correlation.time_corr_wide_nans                                                                                                           2/6 failed
[ 67.86%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman   20.1±0.1s     20.0±0s   
               kendall      failed       failed   
               pearson    1.42±0.02s   1.41±0.03s 
              ========== ============ ============

@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

You might need to skip the Kendall method in benchmarks where failing to get CI to pass. can you open up a follow up issue to address?

@dsaxton
Copy link
Member Author

dsaxton commented Sep 5, 2019

You might need to skip the Kendall method in benchmarks where failing to get CI to pass. can you open up a follow up issue to address?

Is there an easy way to skip or xfail a benchmark in asv?

@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

Try raising a NotImplementedError

@dsaxton
Copy link
Member Author

dsaxton commented Sep 6, 2019

Try raising a NotImplementedError

I was seeing the benchmark still fail after doing this. Feels like a hack, but how about returning None (or maybe making the data just small enough so they all pass)?

@dsaxton
Copy link
Member Author

dsaxton commented Sep 6, 2019

@WillAyd I lowered the dimensions of the wide benchmarks to 1000 by 300 and nothing fails now; also reverted to the simpler implementation since the performance was about the same. Updated asv results:

corr-perf

[ 54.17%] ··· stat_ops.Correlation.time_corr                                                                                                                             ok
[ 54.17%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman   7.68±0.6ms   8.13±0.4ms 
               kendall     193±8ms      193±5ms   
               pearson    2.04±0.1ms   1.99±0.2ms 
              ========== ============ ============

[ 62.50%] ··· stat_ops.Correlation.time_corr_wide                                                                                                                        ok
[ 62.50%] ··· ========== =========== ===========
              --              use_bottleneck    
              ---------- -----------------------
                method       True       False   
              ========== =========== ===========
               spearman    666±30ms    674±20ms 
               kendall    20.0±0.1s   19.7±0.1s 
               pearson     429±30ms    456±20ms 
              ========== =========== ===========

[ 66.67%] ··· stat_ops.Correlation.time_corr_wide_nans                                                                                                                   ok
[ 66.67%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman   7.29±0.06s   7.72±0.07s 
               kendall    20.5±0.4s    20.2±0.7s  
               pearson     561±8ms      562±30ms  
              ========== ============ ============

master

[ 79.17%] ··· stat_ops.Correlation.time_corr                                                                                                                             ok
[ 79.17%] ··· ========== ============= ============
              --               use_bottleneck      
              ---------- --------------------------
                method        True        False    
              ========== ============= ============
               spearman     90.7±2ms     90.2±3ms  
               kendall     193±0.9ms     195±3ms   
               pearson    1.90±0.07ms   1.88±0.1ms 
              ========== ============= ============

[ 87.50%] ··· stat_ops.Correlation.time_corr_wide                                                                                                                        ok
[ 87.50%] ··· ========== ============ ============
              --               use_bottleneck     
              ---------- -------------------------
                method       True        False    
              ========== ============ ============
               spearman   8.43±0.08s   8.46±0.1s  
               kendall    19.3±0.03s   19.4±0.06s 
               pearson     438±20ms     445±20ms  
              ========== ============ ============

[ 91.67%] ··· stat_ops.Correlation.time_corr_wide_nans                                                                                                                   ok
[ 91.67%] ··· ========== ============ ===========
              --              use_bottleneck     
              ---------- ------------------------
                method       True        False   
              ========== ============ ===========
               spearman   7.23±0.01s   7.55±0.1s 
               kendall    19.2±0.08s   19.0±0.2s 
               pearson     505±20ms     532±20ms 
              ========== ============ ===========

@jreback
Copy link
Contributor

jreback commented Sep 6, 2019

pls decrease the size even more

20s is too long

@dsaxton
Copy link
Member Author

dsaxton commented Sep 7, 2019

pls decrease the size even more

20s is too long

Updated to 200 columns, longest benchmark takes ~8 seconds

@jreback jreback added this to the 1.0 milestone Sep 7, 2019
@jreback
Copy link
Contributor

jreback commented Sep 7, 2019

lgtm. over to you @WillAyd

@dsaxton could even reduce the benchmark time further as the relative statistics are still telling. (and we have a ton of benchmarks; it already takes 1 hour + to run them all...)

@WillAyd WillAyd merged commit 7111927 into pandas-dev:master Sep 7, 2019
@WillAyd
Copy link
Member

WillAyd commented Sep 7, 2019

Thanks @dsaxton very nice PR. Would certainly love any follow ups or improvements you can offer!

@dsaxton dsaxton deleted the corr-perf branch September 8, 2019 01:20
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.corr(method="spearman") much slower than DataFrame.rank().corr(method="pearson")
5 participants