-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Speed up Spearman calculation #28151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR - can you provide a code sample for the caveat mentioned? Not totally sure I follow |
does the memory footprint take a hit? |
Sure, if we rank first then compute the correlation a difference in missing patterns can cause distortions like in this example (where the rank correlation should be one):
I think this is the reason why the ranking is done inside the loop, but we want to avoid this step and rely on precalculated ranks if we can. |
I think it would since we're creating a whole new array of ranks. I suppose we could throw away |
I guess better phrasing would have been "is the memory footprint impact big enough to care about?" |
It does basically double memory usage (
|
Is there a good heuristic that we can use to check this? |
The worst case would be when every column has a different missing pattern and the first step of precalculating is for naught, and in this case you do an extra When I run that scenario on this branch I do get worse performance (4.74 seconds vs. 4.49 seconds using a 5000 by 100 array), but it's actually almost within a margin of error of the current implementation. |
Took me a little big to grasp but you mean't by "missing pattern" but thinking I get it now wouldn't it be more likely for the NA value locations to vary by columns? If so probably want to rethink approach here |
Hmm, fair point. Since we actually know whether this is true for any pair of columns before we calculate the ranks in the nested loop, we could actually check this condition before deciding if we want to save off the full set of ranks and avoid any extra calculation. The only cost would be the additional memory. I'll try to make an update with that change soon. |
@WillAyd Just pushed a commit that avoids precalculating any of the ranks. Let me know if this is getting a bit too confusing and I can try to rewrite. Edit: Also reran the benchmarks and still seeing the same speedup. |
Can you add a benchmark that injects missing data into the calculation? I think it would help frame the discussion around performance a little better if that's something that needs to be accounted for |
pandas/_libs/algos.pyx
Outdated
@@ -307,26 +308,44 @@ def nancorr_spearman(const float64_t[:, :] mat, Py_ssize_t minp=1): | |||
result = np.empty((K, K), dtype=np.float64) | |||
mask = np.isfinite(mat).view(np.uint8) | |||
|
|||
ranked_mat = np.empty((N, K), dtype=np.float64) | |||
cached_ranks = set({}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the whole caching thing you have here is to avoid the memory overhead of ranking columns up front right? If so I think it's a premature optimization so would probably just prefer if everything was ranked and we didn't do caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this approach would accept a hit on memory by preallocating ranked_mat
(even this is probably not necessary), but save time by not calculating ranks up front that we end up not reusing. Instead we wait until we need a full set of ranks for some calculation before computing and saving them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much of a time savings is that? In any case your PR should offer a significant speed up by not ranking in each loop so I think this might still be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was actually fairly negligible based on some local testing I did (maybe 5% slowdown), but it's probably worthwhile to add some more cases to stat_ops.py
(wide tables, possibly a worst case missing pattern for this change, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see any perf difference here? Again would rather simplify and not have this if there's no major perf impact
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually seeing essentially the same performance from the previous implementation (ranking everything up front) on the "wide with nans" benchmark, so maybe just revert to that approach since the code is a lot easier to understand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the existing benchmarks cover enough here? (e.g. enough matrix sizes)
pls add a perf note
Added a couple benchmarks, one with wide data (1000 by 500), wide data with random
Couple things that didn't make it into the summary but are interesting:
(Not directly relevant to this pull request, but Kendall's tau fails on the wide data. It's not shown here, but it fails without
Surprisingly the peak memory usage is not much worse than
|
You might need to skip the Kendall method in benchmarks where failing to get CI to pass. can you open up a follow up issue to address? |
Is there an easy way to skip or xfail a benchmark in |
Try raising a |
I was seeing the benchmark still fail after doing this. Feels like a hack, but how about returning |
@WillAyd I lowered the dimensions of the wide benchmarks to 1000 by 300 and nothing fails now; also reverted to the simpler implementation since the performance was about the same. Updated
|
pls decrease the size even more 20s is too long |
Updated to 200 columns, longest benchmark takes ~8 seconds |
Thanks @dsaxton very nice PR. Would certainly love any follow ups or improvements you can offer! |
Saves off ranks for later reuse once they're calculated during the computation of
DataFrame.corr(method="spearman")
so as to speed up the calculation.This PR ranks data outside the nested loop in(closes #28139). The performance benchmarks show a nice improvement:nancorr_spearman
so as to speed up the calculation(One caveat here is that when two columns have different missing patterns we actually need to recompute ranks over the rows where both are non-missing, so in some situations this will actually perform slightly worse than the current implementation.)