-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Enable DataFrame.corrwith to compute rank correlations #22375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you are changing the impl can you run the asv for perf check
I'm having some trouble finding a relevant benchmark for |
Here is one I put together which seems to suggest around a 33-50% speed-up with the new implementation. import numpy as np
from pandas import DataFrame, Series
np.random.seed(2357)
def corrwith_old(df1, df2, axis=0, drop=False):
axis = df1._get_axis_number(axis)
this = df1._get_numeric_data()
if isinstance(df2, Series):
return this.apply(df2.corr, axis=axis)
df2 = df2._get_numeric_data()
left, right = this.align(df2, join='inner', copy=False)
# mask missing values
left = left + right * 0
right = right + left * 0
if axis == 1:
left = left.T
right = right.T
# demeaned data
ldem = left - left.mean()
rdem = right - right.mean()
num = (ldem * rdem).sum()
dom = (left.count() - 1) * left.std() * right.std()
correl = num / dom
if not drop:
raxis = 1 if axis == 0 else 0
result_index = this._get_axis(raxis).union(df2._get_axis(raxis))
correl = correl.reindex(result_index)
return correl
def corrwith_new(df1, df2, axis=0, drop=False, method='pearson'):
if method not in ['pearson', 'spearman', 'kendall']:
raise ValueError("method must be either 'pearson', "
"'spearman', or 'kendall', '{method}' "
"was supplied".format(method=method))
axis = df1._get_axis_number(axis)
this = df1._get_numeric_data()
if isinstance(df2, Series):
return this.apply(lambda x: df2.corr(x, method=method),
axis=axis)
df2 = df2._get_numeric_data()
left, right = this.align(df2, join='inner', copy=False)
if axis == 1:
left = left.T
right = right.T
correl = (left.apply(lambda x:
x.corr(right[x.name], method=method)))
if drop:
correl.dropna(inplace=True)
return correl
data1 = DataFrame(np.random.normal(size=(10**6, 10)))
data2 = DataFrame(np.random.normal(size=(10**6, 10)))
%timeit corrwith_old(data1, data2)
%timeit corrwith_new(data1, data2) |
can u add this as an asv in particular i suspect you method will slow down when used with a larger number of columns so pls show that as well |
Codecov Report
@@ Coverage Diff @@
## master #22375 +/- ##
===========================================
- Coverage 92.31% 31.89% -60.43%
===========================================
Files 166 166
Lines 52335 52429 +94
===========================================
- Hits 48313 16722 -31591
- Misses 4022 35707 +31685
Continue to review full report at Codecov.
|
I've added the new benchmark function |
@jreback I think your suspicion may have been accurate regarding the dimension of the inputs. In any case, for the tests it looks like there's an issue with |
@dsaxton we test quite a few different configurations, not all of which have SciPy installed. If your tests depend on that package you should use the |
asv_bench/benchmarks/stat_ops.py
Outdated
|
||
def time_corr(self, method): | ||
self.df.corr(method=method) | ||
|
||
def time_corrwith(self, method): | ||
self.df.corrwith(df2, method=method) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df2
should be self.df2
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thank you.
@jreback If a certain benchmark is actually part of the PR, what would you say is the most straightforward way to show that there isn't a degradation in performance? Here the |
You should have for all three. The two new ones will fail since they don't exist on master but that's OK - still sets a baseline going forward and we can still get insights out of the Pearson benchmark |
Here are the results I get running the |
I modified the body of the function to mimic how the current calculation is being done in the special case where if method in ['spearman', 'kendall']:
correl = (left.apply(lambda x:
x.corr(right[x.name], method=method)))
else:
correl = (((right - right.mean()) * (left - left.mean())).mean()
/ right.std() / left.std()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a whatsnew note in other enhancements
method : {'pearson', 'kendall', 'spearman'} | ||
* pearson : standard correlation coefficient | ||
* kendall : Kendall Tau correlation coefficient | ||
* spearman : Spearman rank correlation | ||
|
||
Returns | ||
------- | ||
correls : Series | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a See Also and revert to .corr (and add to .corr a See Also referring to .corrwith).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do the asv's show ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a whatsnew as well (with both issues)
don't post pictures of the asv's just copy-paste the text pls. |
|
Duplicate column names doesn't necessarily mean duplicate values.
We handle this correctly on master, so you should be able to add a test on
master that passes, and ensure it still passes on your branch.
…On Thu, Aug 23, 2018 at 7:49 AM Daniel Saxton ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/frame.py
<#22375 (comment)>:
> if axis == 1:
left = left.T
right = right.T
- # demeaned data
- ldem = left - left.mean()
- rdem = right - right.mean()
-
- num = (ldem * rdem).sum()
- dom = (left.count() - 1) * left.std() * right.std()
+ correl = (left.apply(lambda x:
+ x.corr(right[x.name], method=method)))
I'll test that out. What would you say the expected behavior should be in
this case? Would it make sense to first drop duplicate columns from self
and other?
—
You are receiving this because your review was requested.
Reply to this email directly, view it on GitHub
<#22375 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIoYTVuSsxAk89rEjHQ6FBOjIfF1aks5uTqTAgaJpZM4V_AKR>
.
|
@TomAugspurger I just pushed some changes that include the above ( Regarding the test, I am creating a def test_corrwith_dup_cols(self):
# GH 21925
df1 = pd.DataFrame(np.vstack([np.arange(10)] * 3).T)
df2 = df1.copy()
df2 = pd.concat((df2, df2[0]), axis=1)
result = df1.corrwith(df2).values
expected = np.ones(4)
tm.assert_almost_equal(result, expected) |
can you rebase |
Hello @dsaxton! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on December 28, 2018 at 02:42 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the perf diff with this (also add something with a lot of columns, like 100) to see what effect that has
pandas/core/frame.py
Outdated
|
||
num = (ldem * rdem).sum() | ||
dom = (left.count() - 1) * left.std() * right.std() | ||
correl = Series(map(lambda x: Series(x[0]).corr(Series(x[1]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of nesting this, can you create an in-line named function that you pass to the map
can you merge master and update to comments |
Looks like there's a linting error:
|
pandas/core/frame.py
Outdated
dom = (left.count() - 1) * left.std() * right.std() | ||
correl = num / dom | ||
|
||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do elif method in ['.....], and add an else clause that raises (wrong method is passed)
@pytest.mark.xfail | ||
def test_corrwith_dup_cols(self): | ||
# GH 21925 | ||
df1 = pd.DataFrame(np.vstack([np.arange(10)] * 3).T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an example with an empty frame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test to ensure that the output is an empty Series
(with the proper index)?
pandas/tests/frame/test_analytics.py
Outdated
@@ -466,6 +466,33 @@ def test_corrwith_mixed_dtypes(self): | |||
expected = pd.Series(data=corrs, index=['a', 'b']) | |||
tm.assert_series_equal(result, expected) | |||
|
|||
@pytest.mark.xfail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I'm missing something, but why is this xfailed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had xfailed this because master was giving an error when columns were duplicated (ValueError: cannot reindex from a duplicate axis
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you rewrite the test to not have duplicate labels?
But I notice now that on master we do allow duplicates in DataFrame.corrwith
In [6]: df = pd.DataFrame([[1, 2], [3, 4]], columns=[0, 0], index=[1, 1])
In [7]: df.corrwith(df)
Out[7]:
0 1.0
0 1.0
dtype: float64
so we may need to adjust the implementation to not regress on that. It'd be good to add some tests for that if they aren't already present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I wonder why that code sample works but this one gives an error:
import numpy as np
import pandas as pd
df1 = pd.DataFrame(np.random.random(size=(10, 2)), columns=["a", "b"])
df2 = pd.DataFrame(np.random.random(size=(10, 2)), columns=["a", "a"])
df1.corrwith(df2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger By duplicate labels do you mean the values within the DataFrame
themselves (i.e., not the indices or column names)?
can you merge master, also update to handle duplicates |
* Remove incorrect error (didn't account for callables) * Add xfail to duplicate columns test * Fix transpose (was taken twice for Pearson) * Remove inplace usage for dropna
* Check for invalid method * Do not cast arrays to Series in function c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment. ping on green.
* Add comment for when drop is False * Check if len(idx_diff) > 0 * Remove unnecessary string casting in error message
lgtm. ping on green. |
@jreback Looks like it's passing. Thanks for your help and patience! |
thanks @dsaxton |
This PR is to enable
DataFrame.corrwith
to calculate rank correlations in addition to Pearson's correlation (and should hopefully be fully backwards compatible), as well as clarifying functionality in the docstring . An error message is generated if the user specifies a form of correlation that isn't implemented. Tests were added for the new behavior.closes #22328
closes #21925