Skip to content

Adds triangular option to DataFrame.corr, closes #22840 #22842

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from
Closed

Adds triangular option to DataFrame.corr, closes #22840 #22842

wants to merge 9 commits into from

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Sep 26, 2018

Adds the tri argument to DataFrame.corr.

@pep8speaks
Copy link

Hello @dsaxton! Thanks for submitting the PR.

tri : boolean, default : False
Whether or not to return the lower triangular correlation
matrix

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a versionadded directive here?

@@ -138,6 +138,12 @@ def test_corr_invalid_method(self):
with tm.assert_raises_regex(ValueError, msg):
df.corr(method="____")

def test_corr_tri(self):
# GH PR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should reference this issue number (22840)

@mroeschke
Copy link
Member

Some general comments:

  1. If we were to include a tri argument, I would prefer a tri=None(default)/'upper'/'lower' instead of tri=True/False for just the lower triangular so users can have the option of the upper triangular.
  2. In your implementation, it appears fill value would be NaN as opposed to numpy (triu/trul) which the fill value is 0. I am not sure what the preferred fill value should be, but in general we try to stay consistent with numpy conventions.

@dsaxton
Copy link
Member Author

dsaxton commented Sep 26, 2018

@mroeschke Good points, I can change the argument to "upper", "lower" or None rather than True or False. It's interesting that numpy uses zeros for omitted entries, would be curious to hear how others feel.

change tri options from boolean to string
modify tests
add versionadded tag
@codecov
Copy link

codecov bot commented Sep 26, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@a03d953). Click here to learn what that means.
The diff coverage is 90%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #22842   +/-   ##
=========================================
  Coverage          ?   90.61%           
=========================================
  Files             ?      169           
  Lines             ?    50835           
  Branches          ?        0           
=========================================
  Hits              ?    46062           
  Misses            ?     4773           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.61% <90%> (?)
Impacted Files Coverage Δ
pandas/core/frame.py 97.16% <90%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a03d953...9df2bcb. Read the comment docs.

@dsaxton dsaxton changed the title Adds lower triangular option to DataFrame.corr, closes #22840 Adds triangular option to DataFrame.corr, closes #22840 Sep 26, 2018
update whatsnew with argument type change
corr_mat = self._constructor(correl, index=idx, columns=cols)

if tri is not None:
mask = np.tril(np.ones_like(corr_mat,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use correl instead of corr_mat here to avoid the conversion hit of DataFrame to numpy array when np.ones_like is called? (I think correl should be a numpy array here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, nice catch. correl should already be a numpy array.

@@ -138,6 +138,18 @@ def test_corr_invalid_method(self):
with tm.assert_raises_regex(ValueError, msg):
df.corr(method="____")

def test_corr_tril(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these tests, it's better to build an expected DataFrame output and compare it to a resulting DataFrame. You can build a DataFrame that gives a trivial correlation matrix (like a np.ones DataFrame)

result = df.corr(tri=...)
expected = pd.DataFrame(...)
tm.assert_frame_equal(result, expected)

Copy link
Member Author

@dsaxton dsaxton Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the tests according to this idea. We're still using DataFrames inside ones_like, but I figured this is okay since they're small test cases.

Daniel Saxton added 4 commits September 26, 2018 17:54
use numpy array in ones_like
modify tests to specify expected output
fix errors in tests
@@ -6686,6 +6686,12 @@ def corr(self, method='pearson', min_periods=1):
to have a valid result. Currently only available for pearson
and spearman correlation

tri : {'upper', 'lower'} or None, default : None
Whether or not to return the upper / lower triangular
correlation matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an explanation for why a user might want to do this? And an example below using upper or lower?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, you mean within the docstring itself right?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 27, 2018 via email

k=-1)

if tri == 'lower':
return corr_mat.where(mask)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the speed is probably important, but corr+np.tril(corr*np.nan) is probably faster than indexing.

@bashtage
Copy link
Contributor

2. In your implementation, it appears fill value would be NaN as opposed to numpy (triu/trul) which the fill value is 0. I am not sure what the preferred fill value should be, but in general we try to stay consistent with numpy conventions.

NumPy's tril/u are standard matrix algebra ops. In the use case argued for this api change, the reason is to have simple ops on the correlation to find certain features. Presumably you want to not double count.

I don't really see why a masked triangular is the right approach to the stated problem. For example, if I wanted to find the variable with the highest correlation, I would be just as happy to have all elements except the diagonal. If I want to find the variable with the highest average correlation to the others, I need a full matrix except the diagonal to make it easy to call sum()

tri is probably the wrong description. mask with choices {lower,upper, diag} seems like it would be more general.

@dsaxton
Copy link
Member Author

dsaxton commented Sep 28, 2018

@bashtage It allows for simple one-liners like corr_mat.where(corr_mat.abs() > 0.9).dropna(how="all", axis=1).columns to extract the names of single columns from highly correlated pairs as mentioned in #22840. This doesn't work with the full correlation matrix, or the matrix with only the diagonal removed.

@dsaxton
Copy link
Member Author

dsaxton commented Sep 28, 2018

@TomAugspurger Regarding the docstring example, should it be something as in the most recent commit, or perhaps just the full lower triangular form itself?

@bashtage
Copy link
Contributor

corr_mat.where(corr_mat.abs() > 0.9).dropna(how="all", axis=1).columns is a strnage operation since it is order dependent while a correlation matrix is not. In particular, this method favors dropping early columns, and so the utility of the method depends on users knowing that order matters. I would say this is why masking the diagonal is a more natural method.

np.random.seed(1)
x = np.random.standard_normal((20,5))+3*np.random.standard_normal((20,1))
df = pd.DataFrame(x,columns = ['c{0}'.format(i) for i in range(5)])
corr_mat = df.corr()
mask = ~np.triu(np.ones((5,5),dtype=np.bool))
c2 = corr_mat.where(mask)
c2
Out[40]: 
          c0        c1        c2        c3  c4
c0       NaN       NaN       NaN       NaN NaN
c1  0.828210       NaN       NaN       NaN NaN
c2  0.852960  0.938965       NaN       NaN NaN
c3  0.831917  0.919567  0.921551       NaN NaN
c4  0.917205  0.899102  0.916828  0.881752 NaN
c2.where(c2.abs() > 0.9).dropna(how="all", axis=1).columns 

shows

Out[41]: Index(['c0', 'c1', 'c2'], dtype='object')

But c4 has a correlation > 0.9, as does c3.

Masking the diagonal

mask = np.eye(5, dtype=np.bool)
c3 = corr_mat.where(mask)
c3.where(c3.abs() > 0.9).dropna(how="all", axis=1).columns 

returns the list of variables that have any correlation > 0.9 without inducing an order dependnce

Out[45]: Index(['c0', 'c1', 'c2', 'c3', 'c4'], dtype='object')

@dsaxton
Copy link
Member Author

dsaxton commented Sep 28, 2018

@bashtage The reason for corr_mat.where(corr_mat.abs() > 0.9).dropna(how="all", axis=1).columns is to (arbitrarily, so we don't care if the choice is affected by the order of columns) select one column from each correlated pair (we don't want to select every column).

In your example we would select all but c3 and c4 because these do not have a correlation of at least 0.9 with each other. The point is to select a minimal set of columns we'd need to remove to get rid of the high correlations.

@bashtage
Copy link
Contributor

Here is a better example of why this mask can be very misleading. Suppose the correlation matrix is

1.000 0.880 0.880 0.914
0.880 1.000 0.880 0.914
0.880 0.880 1.000 0.914
0.914 0.914 0.914 1.000

Looking at a triangular view (lower) would delete all but the last.

Pivoting this column to be in the middle makes the selection used either triangular method even more problematic.

My main point is that this seems to be a non-general mask that is not even well suited to removing variables that are highly correlated . One algorithm used in some LASSO estimators does thie following -- ding the pair with the highest correlation and then remove the the element that has the highest correaltion with the remaining -- this isn't easy to implement with a lower triangular masked view.

@jreback
Copy link
Contributor

jreback commented Sep 28, 2018

I am not really in love with this option. I would prefer this simply be a cookbook example.

@dsaxton
Copy link
Member Author

dsaxton commented Sep 28, 2018

@jreback Ok, I will close the PR and do that instead.

@dsaxton dsaxton closed this Sep 28, 2018
@dsaxton dsaxton deleted the corr-tri branch September 28, 2018 12:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Allow option to return lower triangular correlation matrix in DataFrame.corr
7 participants