Skip to content

ENH: Allow pairwise calcuation when comparing the column with itself … #43569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

peterpanmj
Copy link
Contributor

@peterpanmj peterpanmj commented Sep 14, 2021

…(#25781)

@peterpanmj peterpanmj force-pushed the pairwise branch 2 times, most recently from ab9be5e to bb1675f Compare September 16, 2021 15:18
@jreback jreback added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement labels Sep 18, 2021
@peterpanmj peterpanmj changed the title BUG: Allow pairwise calcuation when comparing the column with itself … ENH: Allow pairwise calcuation when comparing the column with itself … Sep 27, 2021
@@ -9416,6 +9417,10 @@ def corr(
Minimum number of observations required per pair of columns
to have a valid result. Currently only available for Pearson
and Spearman correlation.
calculate_diagonal : bool, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is descriptive, but is there precedent for this type of naming, e.g numpy?

Copy link
Contributor Author

@peterpanmj peterpanmj Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is descriptive, but is there precedent for this type of naming, e.g numpy?

No. I came up with this myself. I'm not familiar with Numpy naming convention. Any suggestion ?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew note in other enhancements for 1.4

@peterpanmj peterpanmj force-pushed the pairwise branch 2 times, most recently from 430afdd to e794c82 Compare October 9, 2021 09:50
@peterpanmj peterpanmj requested a review from jreback October 12, 2021 01:49
@jreback jreback added this to the 1.4 milestone Oct 21, 2021
@jreback
Copy link
Contributor

jreback commented Oct 21, 2021

@pandas-dev/pandas-core if anyone has comments on this (naming in particular)

@bashtage
Copy link
Contributor

This feels like a pretty exotic feature. Is there a clear use case for it becoming an option? As far as I can tell, the correlation will be 1 except in edge cases where it is strange things like NaN or Inf. Should this be left to users to adjust after the call to corr if they need this behavior?

@rhshadrach
Copy link
Member

Why not just always do the calculation? What is the use case for overriding the correct value with 1?

@bashtage
Copy link
Contributor

What is the correct correlation between a constant series and anything? The most common answer is 0. But this isn't what most software will return because they usually don't specialize the case where one or both values are constant.

@jbrockmendel
Copy link
Member

I agree with @bashtage here. If a situation is weird enough that a user wants to pass a kwarg for this, it is just as easy to patch the result directly; no need to bloat the API.

Why not just always do the calculation? What is the use case for overriding the correct value with 1?

First thing that comes to mind is avoiding floating point error.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Oct 21, 2021

What corr() provides is computation of the correlation matrix of a DataFrame, where each pair of columns (Series) of the DataFrame is then used to compute a correlation value. For our defined correlation functions (‘pearson’, ‘kendall’, ‘spearman’), the correlation of two identical columns is defined as 1. We also allow user-specified correlation functions. In any case, if you have a DataFrame with m rows and n columns, the result is a DataFrame with n rows and n columns. In our docs, we say that for a user-supplied correlation function, the diagonal entries will be 1.

In the original issue, the user wanted to override the default value of 1. It seems to me there are a few ways of looking at this.

  1. Users want to use a "correlation" function that returns diagonal values not equal to 1. One option is that the diagonal values should all be the same, in which case a keyword argument of diagonal: float = 1.0 would make sense, and we stuff the same value on the diagonal.
  2. Users want to use a "correlation" function that returns diagonal values not equal to 1. A second option is that the diagonal values should be different, in which case a keyword argument of compute_diagonal: bool= False would make sense, and we let the supplied correlation function compute the diagonal (as in this PR)
  3. One could argue that the only proper definition of a correlation function is that the diagonal values are 1.0. Two identical Series are 100% correlated. So our default value of 1.0 for a user-supplied function makes sense. In that case, we let users override the values via something like:
    result = df.corr(lambda x,y: myfunc(x,y))
    for i in range(len(df.columns)):
        result.iloc[i,i] = myfunc(df.iloc[:,i], df.iloc[:,i])

The latter seems a bit awkward, so I would disagree with @jbrockmendel where he says " it is just as easy to patch the result directly"

I think it is reasonable to ask @fabianrost84 (OP on the original issue) and @peterpanmj whether:

  1. Do they want to supply one value for all the diagonals, or have different values for each pair?
  2. Why is the workaround listed above using a loop not sufficient?

@attack68
Copy link
Contributor

I think as @Dr-Irv mentions corr() is expected to produce a correlation matrix and for all intents and purposes avoiding the floating point rounding issues for unitary diagonals @jbrockmendel highlights is going to be useful (and more efficient) for almost all users of this method, so I wouldn't bother with the API bloat.

In the rare cases a user doesn't want a unitary diagonal and wants to compute something else, I would almost interpret this as a more generalist user defined distance function of their vector space. I.e. a df.norm(callable) method might be a more generalist place for this computation, where df.norm(pearson) would equate to the more specific df.corr(), excluding the rounding problems, for example. But perhaps that might almost be a rarely used method.

@jreback
Copy link
Contributor

jreback commented Oct 28, 2021

@peterpanmj ok i guess should just close this and instead update the doc-string of .corr with an example of how to change the diagonal values.

@jreback jreback removed this from the 1.4 milestone Oct 28, 2021
@peterpanmj
Copy link
Contributor Author

peterpanmj commented Oct 29, 2021

@peterpanmj ok i guess should just close this and instead update the doc-string of .corr with an example of how to change the diagonal values.

I am ok with that. @jreback Should I raise a new issue about updating the doc-string ?

@peterpanmj peterpanmj closed this Oct 29, 2021
@peterpanmj
Copy link
Contributor Author

I want to point out that this is not a rarely met usecase. People are using df.corr(callable=) for purpose other than just correlation. In my case, I want to calculate pairwise distance matrix () for all rows in a dataframe (which is quite common) Now, I can almost achieve this by using

   df.T.corr(distance.jaccard)

I know df.corr is not designed for this. Perhaps it is better to create a new method for this kind of jobs ?

@bashtage
Copy link
Contributor

@peterpanmj It sounds like you are looking for an "N"-way apply. Basically something like df.apply(func, cols=["a","b","c","d","e"], group_size=3) which would apply func to all distinct sets of group_size columns in the list cols and woudl the return a MultiIndex DataFrame. In the case of group_size=2, a simple unstack should get you a square frame.

@attack68
Copy link
Contributor

@peterpanmj It sounds like you are looking for an "N"-way apply. Basically something like df.apply(func, cols=["a","b","c","d","e"], group_size=3) which would apply func to all distinct sets of group_size columns in the list cols and woudl the return a MultiIndex DataFrame. In the case of group_size=2, a simple unstack should get you a square frame.

Yes this is exactly the more generalist description I was trying to get at. You made it sound like this existed already, but think this is a great idea. +1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using corr with callable gives 1 on diagonals where the result should be NaN
7 participants