BUG: Cythonized kendall correlation implementation is incorrect #43401
Labels
Algos
Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
Bug
Performance
Memory or execution speed performance
Regression
Functionality that used to work in a prior pandas version
Milestone
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Problem description
The Cythonized implementation of Kendall correlation implemented here has been completely broken in the case of tied values since its original commit.
Scipy (and most other stats pacakges) implement the b variant of Kendall's tau, not the original version. Rank based correlation algorithms absolutely need to handle ties correctly as ordinal data often includes repeated values in some columns.
Even if we wanted to implement the "a variant" of Kendall's tau (which would be a breaking change from the implementation prior to the Cythonization diff), the correct code would disregard ties entirely, whereas the current implementation is including them in its calculations.
It's non-trivial to implement a Cythonized version of the correct code with tie-handling here, and I think that the Cythonized version should be reverted.
Code Sample, a copy-pastable example
Expected Output
True
Change to test case that catches this bug
zrait@94a0eea is a simple change to the correlation test case that ensures that one of the columns used has repeated values and fails with the current implementation but not if the Cythonization is reverted.
More generally, I think the float test data should be updated to include duplicates, as a danger of randomly generated float data is the fact that it won't naturally include any duplicates.
The text was updated successfully, but these errors were encountered: