Adds triangular option to DataFrame.corr, closes #22840 #22842

dsaxton · 2018-09-26T19:22:00Z

Adds the tri argument to DataFrame.corr.

[ x ] closes ENH: Allow option to return lower triangular correlation matrix in DataFrame.corr #22840
[ x ] tests added / passed
[ x ] passes git diff upstream/master -u -- "*.py" | flake8 --diff
[ x ] whatsnew entry

add whatsnew entry

pep8speaks · 2018-09-26T19:22:05Z

Hello @dsaxton! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/core/frame.py !
There are no PEP8 issues in the file pandas/tests/frame/test_analytics.py !

jschendel · 2018-09-26T19:29:50Z

pandas/core/frame.py

+        tri : boolean, default : False
+            Whether or not to return the lower triangular correlation
+            matrix
+


Can you add a versionadded directive here?

jschendel · 2018-09-26T19:30:24Z

pandas/tests/frame/test_analytics.py

@@ -138,6 +138,12 @@ def test_corr_invalid_method(self):
        with tm.assert_raises_regex(ValueError, msg):
            df.corr(method="____")

+    def test_corr_tri(self):
+        # GH PR


This should reference this issue number (22840)

mroeschke · 2018-09-26T19:41:59Z

Some general comments:

If we were to include a tri argument, I would prefer a tri=None(default)/'upper'/'lower' instead of tri=True/False for just the lower triangular so users can have the option of the upper triangular.
In your implementation, it appears fill value would be NaN as opposed to numpy (triu/trul) which the fill value is 0. I am not sure what the preferred fill value should be, but in general we try to stay consistent with numpy conventions.

dsaxton · 2018-09-26T19:46:53Z

@mroeschke Good points, I can change the argument to "upper", "lower" or None rather than True or False. It's interesting that numpy uses zeros for omitted entries, would be curious to hear how others feel.

change tri options from boolean to string modify tests add versionadded tag

codecov · 2018-09-26T21:23:11Z

Codecov Report

❗ No coverage uploaded for pull request base (master@a03d953). Click here to learn what that means.
The diff coverage is 90%.

@@            Coverage Diff            @@
##             master   #22842   +/-   ##
=========================================
  Coverage          ?   90.61%           
=========================================
  Files             ?      169           
  Lines             ?    50835           
  Branches          ?        0           
=========================================
  Hits              ?    46062           
  Misses            ?     4773           
  Partials          ?        0

Flag	Coverage Δ
#multiple	`90.61% <90%> (?)`

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.16% <90%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a03d953...9df2bcb. Read the comment docs.

update whatsnew with argument type change

mroeschke · 2018-09-26T22:11:58Z

pandas/core/frame.py

+        corr_mat = self._constructor(correl, index=idx, columns=cols)
+
+        if tri is not None:
+            mask = np.tril(np.ones_like(corr_mat,


Is it possible to use correl instead of corr_mat here to avoid the conversion hit of DataFrame to numpy array when np.ones_like is called? (I think correl should be a numpy array here)

Yes, nice catch. correl should already be a numpy array.

mroeschke · 2018-09-26T22:15:05Z

pandas/tests/frame/test_analytics.py

@@ -138,6 +138,18 @@ def test_corr_invalid_method(self):
        with tm.assert_raises_regex(ValueError, msg):
            df.corr(method="____")

+    def test_corr_tril(self):


For these tests, it's better to build an expected DataFrame output and compare it to a resulting DataFrame. You can build a DataFrame that gives a trivial correlation matrix (like a np.ones DataFrame)

result = df.corr(tri=...) expected = pd.DataFrame(...) tm.assert_frame_equal(result, expected)

I updated the tests according to this idea. We're still using DataFrames inside ones_like, but I figured this is okay since they're small test cases.

use numpy array in ones_like modify tests to specify expected output

fix errors in tests

TomAugspurger · 2018-09-27T13:30:19Z

pandas/core/frame.py

@@ -6686,6 +6686,12 @@ def corr(self, method='pearson', min_periods=1):
            to have a valid result. Currently only available for pearson
            and spearman correlation

+        tri : {'upper', 'lower'} or None, default : None
+            Whether or not to return the upper / lower triangular
+            correlation matrix


Could you add an explanation for why a user might want to do this? And an example below using upper or lower?

Sure, you mean within the docstring itself right?

TomAugspurger · 2018-09-27T13:39:26Z

Yes.

…

On Thu, Sep 27, 2018 at 8:37 AM Daniel Saxton ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In pandas/core/frame.py <#22842 (comment)>: > @@ -6686,6 +6686,12 @@ def corr(self, method='pearson', min_periods=1): to have a valid result. Currently only available for pearson and spearman correlation + tri : {'upper', 'lower'} or None, default : None + Whether or not to return the upper / lower triangular + correlation matrix Sure, you mean within the docstring itself right? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#22842 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIk8oEsMuV4-Al9Lqc3WGAyFg3c1jks5ufNSSgaJpZM4W7Um_> .

bashtage · 2018-09-27T22:47:38Z

pandas/core/frame.py

+                           k=-1)
+
+            if tri == 'lower':
+                return corr_mat.where(mask)


Not the speed is probably important, but corr+np.tril(corr*np.nan) is probably faster than indexing.

bashtage · 2018-09-27T22:54:19Z

2. In your implementation, it appears fill value would be NaN as opposed to numpy (triu/trul) which the fill value is 0. I am not sure what the preferred fill value should be, but in general we try to stay consistent with numpy conventions.

NumPy's tril/u are standard matrix algebra ops. In the use case argued for this api change, the reason is to have simple ops on the correlation to find certain features. Presumably you want to not double count.

I don't really see why a masked triangular is the right approach to the stated problem. For example, if I wanted to find the variable with the highest correlation, I would be just as happy to have all elements except the diagonal. If I want to find the variable with the highest average correlation to the others, I need a full matrix except the diagonal to make it easy to call sum()

tri is probably the wrong description. mask with choices {lower,upper, diag} seems like it would be more general.

dsaxton · 2018-09-28T00:29:31Z

@bashtage It allows for simple one-liners like corr_mat.where(corr_mat.abs() > 0.9).dropna(how="all", axis=1).columns to extract the names of single columns from highly correlated pairs as mentioned in #22840. This doesn't work with the full correlation matrix, or the matrix with only the diagonal removed.

dsaxton · 2018-09-28T03:17:30Z

@TomAugspurger Regarding the docstring example, should it be something as in the most recent commit, or perhaps just the full lower triangular form itself?

bashtage · 2018-09-28T07:13:21Z

corr_mat.where(corr_mat.abs() > 0.9).dropna(how="all", axis=1).columns is a strnage operation since it is order dependent while a correlation matrix is not. In particular, this method favors dropping early columns, and so the utility of the method depends on users knowing that order matters. I would say this is why masking the diagonal is a more natural method.

np.random.seed(1)
x = np.random.standard_normal((20,5))+3*np.random.standard_normal((20,1))
df = pd.DataFrame(x,columns = ['c{0}'.format(i) for i in range(5)])
corr_mat = df.corr()
mask = ~np.triu(np.ones((5,5),dtype=np.bool))
c2 = corr_mat.where(mask)
c2

Out[40]: 
          c0        c1        c2        c3  c4
c0       NaN       NaN       NaN       NaN NaN
c1  0.828210       NaN       NaN       NaN NaN
c2  0.852960  0.938965       NaN       NaN NaN
c3  0.831917  0.919567  0.921551       NaN NaN
c4  0.917205  0.899102  0.916828  0.881752 NaN

c2.where(c2.abs() > 0.9).dropna(how="all", axis=1).columns

shows

Out[41]: Index(['c0', 'c1', 'c2'], dtype='object')

But c4 has a correlation > 0.9, as does c3.

Masking the diagonal

mask = np.eye(5, dtype=np.bool)
c3 = corr_mat.where(mask)
c3.where(c3.abs() > 0.9).dropna(how="all", axis=1).columns

returns the list of variables that have any correlation > 0.9 without inducing an order dependnce

Out[45]: Index(['c0', 'c1', 'c2', 'c3', 'c4'], dtype='object')

dsaxton · 2018-09-28T12:01:45Z

@bashtage The reason for corr_mat.where(corr_mat.abs() > 0.9).dropna(how="all", axis=1).columns is to (arbitrarily, so we don't care if the choice is affected by the order of columns) select one column from each correlated pair (we don't want to select every column).

In your example we would select all but c3 and c4 because these do not have a correlation of at least 0.9 with each other. The point is to select a minimal set of columns we'd need to remove to get rid of the high correlations.

bashtage · 2018-09-28T12:13:42Z

Here is a better example of why this mask can be very misleading. Suppose the correlation matrix is

1.000 0.880 0.880 0.914
0.880 1.000 0.880 0.914
0.880 0.880 1.000 0.914
0.914 0.914 0.914 1.000

Looking at a triangular view (lower) would delete all but the last.

Pivoting this column to be in the middle makes the selection used either triangular method even more problematic.

My main point is that this seems to be a non-general mask that is not even well suited to removing variables that are highly correlated . One algorithm used in some LASSO estimators does thie following -- ding the pair with the highest correlation and then remove the the element that has the highest correaltion with the remaining -- this isn't easy to implement with a lower triangular masked view.

jreback · 2018-09-28T12:19:00Z

I am not really in love with this option. I would prefer this simply be a cookbook example.

dsaxton · 2018-09-28T12:24:21Z

@jreback Ok, I will close the PR and do that instead.

Daniel Saxton added 2 commits September 26, 2018 12:44

modify DataFrame.corr to allow lower triangular correlation matrix

8a2eac3

add test for DataFrame.corr(tri=True)

9a95e8c

add whatsnew entry

jschendel reviewed Sep 26, 2018

View reviewed changes

mroeschke mentioned this pull request Sep 26, 2018

ENH: Allow option to return lower triangular correlation matrix in DataFrame.corr #22840

Closed

PR #22840

9acc618

change tri options from boolean to string modify tests add versionadded tag

dsaxton changed the title ~~Adds lower triangular option to DataFrame.corr, closes #22840~~ Adds triangular option to DataFrame.corr, closes #22840 Sep 26, 2018

PR #22840

c97fec1

update whatsnew with argument type change

mroeschke reviewed Sep 26, 2018

View reviewed changes

Daniel Saxton added 4 commits September 26, 2018 17:54

PR #22842

9b0276a

use numpy array in ones_like modify tests to specify expected output

PR #22840

b15436b

fix errors in tests

modify DataFrame.corr to allow triangular correlation matrix

6be67f6

PR #22842: rebase

d619d50

TomAugspurger reviewed Sep 27, 2018

View reviewed changes

bashtage reviewed Sep 27, 2018

View reviewed changes

PR #22842: add to DataFrame.corr docstring

9df2bcb

dsaxton closed this Sep 28, 2018

dsaxton deleted the corr-tri branch September 28, 2018 12:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds triangular option to DataFrame.corr, closes #22840 #22842

Adds triangular option to DataFrame.corr, closes #22840 #22842

dsaxton commented Sep 26, 2018

pep8speaks commented Sep 26, 2018

jschendel Sep 26, 2018

jschendel Sep 26, 2018

mroeschke commented Sep 26, 2018

dsaxton commented Sep 26, 2018

codecov bot commented Sep 26, 2018 •

edited

Loading

mroeschke Sep 26, 2018

dsaxton Sep 26, 2018

mroeschke Sep 26, 2018

dsaxton Sep 26, 2018 •

edited

Loading

TomAugspurger Sep 27, 2018

dsaxton Sep 27, 2018

TomAugspurger commented Sep 27, 2018 via email

bashtage Sep 27, 2018

bashtage commented Sep 27, 2018

dsaxton commented Sep 28, 2018

dsaxton commented Sep 28, 2018

bashtage commented Sep 28, 2018

dsaxton commented Sep 28, 2018

bashtage commented Sep 28, 2018

jreback commented Sep 28, 2018

dsaxton commented Sep 28, 2018

Adds triangular option to DataFrame.corr, closes #22840 #22842

Adds triangular option to DataFrame.corr, closes #22840 #22842

Conversation

dsaxton commented Sep 26, 2018

pep8speaks commented Sep 26, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Sep 26, 2018

dsaxton commented Sep 26, 2018

codecov bot commented Sep 26, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dsaxton Sep 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Sep 27, 2018 via email

Choose a reason for hiding this comment

bashtage commented Sep 27, 2018

dsaxton commented Sep 28, 2018

dsaxton commented Sep 28, 2018

bashtage commented Sep 28, 2018

dsaxton commented Sep 28, 2018

bashtage commented Sep 28, 2018

jreback commented Sep 28, 2018

dsaxton commented Sep 28, 2018

codecov bot commented Sep 26, 2018 •

edited

Loading

dsaxton Sep 26, 2018 •

edited

Loading