ENH: Allow pairwise calcuation when comparing the column with itself … #43569

peterpanmj · 2021-09-14T16:00:34Z

closes Using corr with callable gives 1 on diagonals where the result should be NaN #25781
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

pandas/core/frame.py

jreback · 2021-09-28T21:00:08Z

pandas/core/frame.py

@@ -9416,6 +9417,10 @@ def corr(
            Minimum number of observations required per pair of columns
            to have a valid result. Currently only available for Pearson
            and Spearman correlation.
+        calculate_diagonal : bool, optional


this is descriptive, but is there precedent for this type of naming, e.g numpy?

this is descriptive, but is there precedent for this type of naming, e.g numpy?

No. I came up with this myself. I'm not familiar with Numpy naming convention. Any suggestion ?

jreback

can you add a whatsnew note in other enhancements for 1.4

…andas-dev#25781)

jreback · 2021-10-21T12:54:53Z

@pandas-dev/pandas-core if anyone has comments on this (naming in particular)

bashtage · 2021-10-21T13:01:27Z

This feels like a pretty exotic feature. Is there a clear use case for it becoming an option? As far as I can tell, the correlation will be 1 except in edge cases where it is strange things like NaN or Inf. Should this be left to users to adjust after the call to corr if they need this behavior?

rhshadrach · 2021-10-21T15:39:49Z

Why not just always do the calculation? What is the use case for overriding the correct value with 1?

bashtage · 2021-10-21T15:47:10Z

What is the correct correlation between a constant series and anything? The most common answer is 0. But this isn't what most software will return because they usually don't specialize the case where one or both values are constant.

jbrockmendel · 2021-10-21T16:35:22Z

I agree with @bashtage here. If a situation is weird enough that a user wants to pass a kwarg for this, it is just as easy to patch the result directly; no need to bloat the API.

Why not just always do the calculation? What is the use case for overriding the correct value with 1?

First thing that comes to mind is avoiding floating point error.

Dr-Irv · 2021-10-21T21:13:07Z

What corr() provides is computation of the correlation matrix of a DataFrame, where each pair of columns (Series) of the DataFrame is then used to compute a correlation value. For our defined correlation functions (‘pearson’, ‘kendall’, ‘spearman’), the correlation of two identical columns is defined as 1. We also allow user-specified correlation functions. In any case, if you have a DataFrame with m rows and n columns, the result is a DataFrame with n rows and n columns. In our docs, we say that for a user-supplied correlation function, the diagonal entries will be 1.

In the original issue, the user wanted to override the default value of 1. It seems to me there are a few ways of looking at this.

Users want to use a "correlation" function that returns diagonal values not equal to 1. One option is that the diagonal values should all be the same, in which case a keyword argument of diagonal: float = 1.0 would make sense, and we stuff the same value on the diagonal.
Users want to use a "correlation" function that returns diagonal values not equal to 1. A second option is that the diagonal values should be different, in which case a keyword argument of compute_diagonal: bool= False would make sense, and we let the supplied correlation function compute the diagonal (as in this PR)
One could argue that the only proper definition of a correlation function is that the diagonal values are 1.0. Two identical Series are 100% correlated. So our default value of 1.0 for a user-supplied function makes sense. In that case, we let users override the values via something like:

    result = df.corr(lambda x,y: myfunc(x,y))
    for i in range(len(df.columns)):
        result.iloc[i,i] = myfunc(df.iloc[:,i], df.iloc[:,i])

The latter seems a bit awkward, so I would disagree with @jbrockmendel where he says " it is just as easy to patch the result directly"

I think it is reasonable to ask @fabianrost84 (OP on the original issue) and @peterpanmj whether:

Do they want to supply one value for all the diagonals, or have different values for each pair?
Why is the workaround listed above using a loop not sufficient?

attack68 · 2021-10-22T17:13:35Z

I think as @Dr-Irv mentions corr() is expected to produce a correlation matrix and for all intents and purposes avoiding the floating point rounding issues for unitary diagonals @jbrockmendel highlights is going to be useful (and more efficient) for almost all users of this method, so I wouldn't bother with the API bloat.

In the rare cases a user doesn't want a unitary diagonal and wants to compute something else, I would almost interpret this as a more generalist user defined distance function of their vector space. I.e. a df.norm(callable) method might be a more generalist place for this computation, where df.norm(pearson) would equate to the more specific df.corr(), excluding the rounding problems, for example. But perhaps that might almost be a rarely used method.

jreback · 2021-10-28T01:29:27Z

@peterpanmj ok i guess should just close this and instead update the doc-string of .corr with an example of how to change the diagonal values.

peterpanmj · 2021-10-29T05:37:13Z

@peterpanmj ok i guess should just close this and instead update the doc-string of .corr with an example of how to change the diagonal values.

I am ok with that. @jreback Should I raise a new issue about updating the doc-string ?

peterpanmj · 2021-10-29T06:24:18Z

I want to point out that this is not a rarely met usecase. People are using df.corr(callable=) for purpose other than just correlation. In my case, I want to calculate pairwise distance matrix () for all rows in a dataframe (which is quite common) Now, I can almost achieve this by using

   df.T.corr(distance.jaccard)

I know df.corr is not designed for this. Perhaps it is better to create a new method for this kind of jobs ?

bashtage · 2021-10-29T06:37:36Z

@peterpanmj It sounds like you are looking for an "N"-way apply. Basically something like df.apply(func, cols=["a","b","c","d","e"], group_size=3) which would apply func to all distinct sets of group_size columns in the list cols and woudl the return a MultiIndex DataFrame. In the case of group_size=2, a simple unstack should get you a square frame.

attack68 · 2021-10-29T09:36:13Z

@peterpanmj It sounds like you are looking for an "N"-way apply. Basically something like df.apply(func, cols=["a","b","c","d","e"], group_size=3) which would apply func to all distinct sets of group_size columns in the list cols and woudl the return a MultiIndex DataFrame. In the case of group_size=2, a simple unstack should get you a square frame.

Yes this is exactly the more generalist description I was trying to get at. You made it sound like this existed already, but think this is a great idea. +1

peterpanmj force-pushed the pairwise branch 2 times, most recently from ab9be5e to bb1675f Compare September 16, 2021 15:18

jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement labels Sep 18, 2021

peterpanmj force-pushed the pairwise branch from bb1675f to 55da48f Compare September 19, 2021 10:21

peterpanmj changed the title ~~BUG: Allow pairwise calcuation when comparing the column with itself …~~ ENH: Allow pairwise calcuation when comparing the column with itself … Sep 27, 2021

jreback requested changes Sep 28, 2021

View reviewed changes

jreback reviewed Sep 28, 2021

View reviewed changes

peterpanmj force-pushed the pairwise branch 2 times, most recently from 430afdd to e794c82 Compare October 9, 2021 09:50

BUG: Allow pairwise calcuation when comparing the column with itself (p…

63f474d

…andas-dev#25781)

peterpanmj force-pushed the pairwise branch from e794c82 to 63f474d Compare October 11, 2021 09:34

peterpanmj requested a review from jreback October 12, 2021 01:49

jreback added this to the 1.4 milestone Oct 21, 2021

Merge branch 'master' into pairwise

d6ac406

jreback approved these changes Oct 21, 2021

View reviewed changes

jreback removed this from the 1.4 milestone Oct 28, 2021

peterpanmj closed this Oct 29, 2021

This was referenced Nov 29, 2021

Using corr with callable gives 1 on diagonals where the result should be NaN #25781

Closed

ENH: Pairwise method or symmetric keyword in DataFrame.corr #44671

Closed

peterpanmj deleted the pairwise branch January 31, 2023 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Allow pairwise calcuation when comparing the column with itself … #43569

ENH: Allow pairwise calcuation when comparing the column with itself … #43569

peterpanmj commented Sep 14, 2021 •

edited

Loading

jreback Sep 28, 2021

peterpanmj Sep 30, 2021 •

edited

Loading

jreback left a comment

jreback commented Oct 21, 2021

bashtage commented Oct 21, 2021

rhshadrach commented Oct 21, 2021

bashtage commented Oct 21, 2021

jbrockmendel commented Oct 21, 2021

Dr-Irv commented Oct 21, 2021

attack68 commented Oct 22, 2021

jreback commented Oct 28, 2021

peterpanmj commented Oct 29, 2021 •

edited

Loading

peterpanmj commented Oct 29, 2021

bashtage commented Oct 29, 2021

attack68 commented Oct 29, 2021

ENH: Allow pairwise calcuation when comparing the column with itself … #43569

ENH: Allow pairwise calcuation when comparing the column with itself … #43569

Conversation

peterpanmj commented Sep 14, 2021 • edited Loading

jreback Sep 28, 2021

Choose a reason for hiding this comment

peterpanmj Sep 30, 2021 • edited Loading

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

jreback commented Oct 21, 2021

bashtage commented Oct 21, 2021

rhshadrach commented Oct 21, 2021

bashtage commented Oct 21, 2021

jbrockmendel commented Oct 21, 2021

Dr-Irv commented Oct 21, 2021

attack68 commented Oct 22, 2021

jreback commented Oct 28, 2021

peterpanmj commented Oct 29, 2021 • edited Loading

peterpanmj commented Oct 29, 2021

bashtage commented Oct 29, 2021

attack68 commented Oct 29, 2021

peterpanmj commented Sep 14, 2021 •

edited

Loading

peterpanmj Sep 30, 2021 •

edited

Loading

peterpanmj commented Oct 29, 2021 •

edited

Loading