Skip to content

BUG: Cythonized kendall correlation implementation is incorrect #43401

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
zrait opened this issue Sep 4, 2021 · 0 comments · Fixed by #43403
Closed
3 tasks done

BUG: Cythonized kendall correlation implementation is incorrect #43401

zrait opened this issue Sep 4, 2021 · 0 comments · Fixed by #43403
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@zrait
Copy link
Contributor

zrait commented Sep 4, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Problem description

The Cythonized implementation of Kendall correlation implemented here has been completely broken in the case of tied values since its original commit.

Scipy (and most other stats pacakges) implement the b variant of Kendall's tau, not the original version. Rank based correlation algorithms absolutely need to handle ties correctly as ordinal data often includes repeated values in some columns.

Even if we wanted to implement the "a variant" of Kendall's tau (which would be a breaking change from the implementation prior to the Cythonization diff), the correct code would disregard ties entirely, whereas the current implementation is including them in its calculations.

It's non-trivial to implement a Cythonized version of the correct code with tie-handling here, and I think that the Cythonized version should be reverted.

Code Sample, a copy-pastable example

df = pd.DataFrame([[1, 2], [2, 1], [1, 3]])
df[0].corr(df[1],method='kendall') == df.corr(method='kendall')[0][1]
# output
False

Expected Output

True

Change to test case that catches this bug

zrait@94a0eea is a simple change to the correlation test case that ensures that one of the columns used has repeated values and fails with the current implementation but not if the Cythonization is reverted.

More generally, I think the float test data should be updated to include duplicates, as a danger of randomly generated float data is the fact that it won't naturally include any duplicates.

@zrait zrait added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 4, 2021
@simonjayhawkins simonjayhawkins added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 5, 2021
@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Sep 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants