-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Adds triangular option to DataFrame.corr, closes #22840 #22842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @dsaxton! Thanks for submitting the PR.
|
tri : boolean, default : False | ||
Whether or not to return the lower triangular correlation | ||
matrix | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a versionadded directive here?
pandas/tests/frame/test_analytics.py
Outdated
@@ -138,6 +138,12 @@ def test_corr_invalid_method(self): | |||
with tm.assert_raises_regex(ValueError, msg): | |||
df.corr(method="____") | |||
|
|||
def test_corr_tri(self): | |||
# GH PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should reference this issue number (22840)
Some general comments:
|
@mroeschke Good points, I can change the argument to "upper", "lower" or None rather than True or False. It's interesting that |
Codecov Report
@@ Coverage Diff @@
## master #22842 +/- ##
=========================================
Coverage ? 90.61%
=========================================
Files ? 169
Lines ? 50835
Branches ? 0
=========================================
Hits ? 46062
Misses ? 4773
Partials ? 0
Continue to review full report at Codecov.
|
pandas/core/frame.py
Outdated
corr_mat = self._constructor(correl, index=idx, columns=cols) | ||
|
||
if tri is not None: | ||
mask = np.tril(np.ones_like(corr_mat, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use correl
instead of corr_mat
here to avoid the conversion hit of DataFrame
to numpy array
when np.ones_like
is called? (I think correl
should be a numpy array here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, nice catch. correl
should already be a numpy
array.
@@ -138,6 +138,18 @@ def test_corr_invalid_method(self): | |||
with tm.assert_raises_regex(ValueError, msg): | |||
df.corr(method="____") | |||
|
|||
def test_corr_tril(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these tests, it's better to build an expected DataFrame output and compare it to a resulting DataFrame. You can build a DataFrame that gives a trivial correlation matrix (like a np.ones
DataFrame)
result = df.corr(tri=...)
expected = pd.DataFrame(...)
tm.assert_frame_equal(result, expected)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the tests according to this idea. We're still using DataFrames
inside ones_like
, but I figured this is okay since they're small test cases.
pandas/core/frame.py
Outdated
@@ -6686,6 +6686,12 @@ def corr(self, method='pearson', min_periods=1): | |||
to have a valid result. Currently only available for pearson | |||
and spearman correlation | |||
|
|||
tri : {'upper', 'lower'} or None, default : None | |||
Whether or not to return the upper / lower triangular | |||
correlation matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an explanation for why a user might want to do this? And an example below using upper or lower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, you mean within the docstring itself right?
Yes.
…On Thu, Sep 27, 2018 at 8:37 AM Daniel Saxton ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/frame.py
<#22842 (comment)>:
> @@ -6686,6 +6686,12 @@ def corr(self, method='pearson', min_periods=1):
to have a valid result. Currently only available for pearson
and spearman correlation
+ tri : {'upper', 'lower'} or None, default : None
+ Whether or not to return the upper / lower triangular
+ correlation matrix
Sure, you mean within the docstring itself right?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#22842 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIk8oEsMuV4-Al9Lqc3WGAyFg3c1jks5ufNSSgaJpZM4W7Um_>
.
|
k=-1) | ||
|
||
if tri == 'lower': | ||
return corr_mat.where(mask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the speed is probably important, but corr+np.tril(corr*np.nan)
is probably faster than indexing.
NumPy's tril/u are standard matrix algebra ops. In the use case argued for this api change, the reason is to have simple ops on the correlation to find certain features. Presumably you want to not double count. I don't really see why a masked triangular is the right approach to the stated problem. For example, if I wanted to find the variable with the highest correlation, I would be just as happy to have all elements except the diagonal. If I want to find the variable with the highest average correlation to the others, I need a full matrix except the diagonal to make it easy to call sum()
|
@bashtage It allows for simple one-liners like |
@TomAugspurger Regarding the docstring example, should it be something as in the most recent commit, or perhaps just the full lower triangular form itself? |
np.random.seed(1)
x = np.random.standard_normal((20,5))+3*np.random.standard_normal((20,1))
df = pd.DataFrame(x,columns = ['c{0}'.format(i) for i in range(5)])
corr_mat = df.corr()
mask = ~np.triu(np.ones((5,5),dtype=np.bool))
c2 = corr_mat.where(mask)
c2
c2.where(c2.abs() > 0.9).dropna(how="all", axis=1).columns shows
But c4 has a correlation > 0.9, as does c3. Masking the diagonal mask = np.eye(5, dtype=np.bool)
c3 = corr_mat.where(mask)
c3.where(c3.abs() > 0.9).dropna(how="all", axis=1).columns returns the list of variables that have any correlation > 0.9 without inducing an order dependnce
|
@bashtage The reason for In your example we would select all but |
Here is a better example of why this mask can be very misleading. Suppose the correlation matrix is
Looking at a triangular view (lower) would delete all but the last. Pivoting this column to be in the middle makes the selection used either triangular method even more problematic. My main point is that this seems to be a non-general mask that is not even well suited to removing variables that are highly correlated . One algorithm used in some LASSO estimators does thie following -- ding the pair with the highest correlation and then remove the the element that has the highest correaltion with the remaining -- this isn't easy to implement with a lower triangular masked view. |
I am not really in love with this option. I would prefer this simply be a cookbook example. |
@jreback Ok, I will close the PR and do that instead. |
Adds the
tri
argument toDataFrame.corr
.git diff upstream/master -u -- "*.py" | flake8 --diff