-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) #46174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
do we have asv's for this? |
I haven't used |
@jreback I'm having difficulty getting I did the following to install and prep:
Attempting to run from any
then it fails with the following error:
What am I doing wrong? I was able to build my fork of Pandas locally using setup.py. I also tried with |
Separately, I do see one test consistently failing in the automated suite: |
@jreback looks like the full asv benchmark ran via GitHub Actions, so we have results there. also looks like all the other GHA tests passed (except for an an Azure build which timed out) |
@fractionalhare a specific asv for this case (I think we do have it but pls check). then show the asv results |
Yep, I see it here: https://github.com/pandas-dev/pandas/blob/main/asv_bench/benchmarks/stat_ops.py#L124. I'll run and add results here. |
@jreback I can't get
However, you can see the results of just the relevant benchmark from the automated test suite at this line: |
Adding to discussion here: is it worth adding more tests of |
yes |
@fractionalhare show the asv result above (or at the very least show a timeit with before/after of the sample) |
…e/pandas into corrwith-perf-improv merge
Okay, I'll add to this PR. |
Here are the
There are only two that apply to this function. And here are my timeit comparisons. I created a 100 x 50,000 DataFrame to test, then pickled it and read it back in to run the timeit test with stable Pandas and my build with this change. Tests were executed like so:
Results before:
Results after:
|
If I instead do it on a small, 100 x 100 DataFrame, I get similar results in terms of percent change. Results before:
Results after:
In both cases (small and large), it's a reduction of 40% - 50% execution time for Pearson and 70% - 80% for Spearman. |
Unclear why the ASV benchmark failed, a test in unrelated/unmodified code triggered an
Tests were all green on this PR prior to me pulling main just now, so maybe this is ephemeral? |
@jreback Apologies since this is my first time using My last commit with an actual code change to the Am I interpreting this correctly? Other than that I have some perf results for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also pls add a whtasnew note in 1.5 in the perf improvements section
…e/pandas into corrwith-perf-improv merge
@jreback In addition to the changes you asked for and the perf tests I've shown with
which will output
Likewise if we do with Spearman instead of Pearson:
we have
|
thanks @fractionalhare |
…n other is a Series and axis = 0 (column-wise) (pandas-dev#46174)
…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" This reverts commit 5efb570.
…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" This reverts commit 5efb570.
…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" This reverts commit 5efb570.
…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" This reverts commit 5efb570.
…tion when other is a Series and axis = 0 (column-wise) (#46174)" (#49140) * Update frame.py * Revert "PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (#46174)" This reverts commit 5efb570. * fix GH issue number in test * add test from original issue * skip if no scipy * add extra test case Co-authored-by: Yuanhao Geng <[email protected]> Co-authored-by: MarcoGorelli <>
…r pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)"
…hod for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (#46174)") (#49151) Backport PR #49140: Revert "PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (#46174)" Co-authored-by: Marco Edward Gorelli <[email protected]>
…tion when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" (pandas-dev#49140) * Update frame.py * Revert "PERF: faster corrwith method for pearson and spearman correlation when other is a Series and axis = 0 (column-wise) (pandas-dev#46174)" This reverts commit 5efb570. * fix GH issue number in test * add test from original issue * skip if no scipy * add extra test case Co-authored-by: Yuanhao Geng <[email protected]> Co-authored-by: MarcoGorelli <>
Description of Changes
This PR modifies the
corrwith()
method defined in pandas/core/frame.py to be faster when all of the following (common) conditions are met.axis
argument is set to 0other
argument is a Pandas Series objectmethod
argument is equal to "pearson" OR "spearman"When all these are true, we replace the
apply()
onself
with column-wise iteration on NumPy arrays and the correlation function againstother
is implemented in NumPy'snp.corrcoef()
. If not all of those conditions are met, the method falls back to its existing functionality with no performance change. The Spearman rank correlation is calculated vianp.corrcoef()
by taking the Pearson correlation coefficient of the vector element ranks withargsort()
.When
method
= "pearson", this change reduces execution time by 40 - 50% on small and large DataFrame objects.When
method
= "spearman", this change reduces execution time by 70 - 80% on small and large DataFrame objects.This change is still robust to null values on the left or right side despite the fact that
np.corrcoef()
is not robust to nulls, because it masks the null values with a boolean&
across both objects when iterating the vectors. It returns accurate results.Testing Environment & Performance Assessment
My testing environment is as follows:
My test script was the following:
First I created a fresh conda environment with the current stable version of Pandas (v1.41). For a DataFrame with 100 rows and 50,000 columns, using unmodified
corrwith()
withmethod="pearson"
took approximately 3.0697 seconds on average across 10 runs each on 10 different randomly generated DataFrames. Withmethod="spearman"
, the same evaluation took approximately 8.5808 seconds on average.Then I made my changes to my fork of Pandas main and built it in a new Conda environment, then ran the same script. For an equivalently sized DataFrame with random numerical data, the modified
corrwith()
withmethod="pearson"
took approximately 1.5342 seconds. Withmethod="spearman"
, it took approximately 2.0556 seconds.