Skip to content

REG: fix regression in df.corrwith on tied data when method is spearman #49032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ Fixed regressions
- Fixed regression in :meth:`DataFrame.apply` when passing non-zero ``axis`` via keyword argument (:issue:`48656`)
- Fixed regression in :meth:`Series.groupby` and :meth:`DataFrame.groupby` when the grouper is a nullable data type (e.g. :class:`Int64`) or a PyArrow-backed string array, contains null values, and ``dropna=False`` (:issue:`48794`)
- Fixed regression in :class:`ExcelWriter` where the ``book`` attribute could no longer be set; however setting this attribute is now deprecated and this ability will be removed in a future version of pandas (:issue:`48780`)
- Fixed regression in :meth:`DataFrame.corrwith` when computing correlation on tied data with ``method="spearman"`` (:issue:`48826`)

.. ---------------------------------------------------------------------------

Expand Down
13 changes: 9 additions & 4 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -10598,18 +10598,23 @@ def corrwith(
cols = self.columns
ndf = self.values.transpose()
k = other.values
k_mask = ~np.isnan(k)
if k.dtype == "bool":
k = k.astype("float")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these casts needed? If needed, can it be converted to int instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bool type won't be accepted by rank_1d's signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we cast it to int, then nan in bool array will be -2147483648.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank should accept uint8, so k = k.view(uint8)

What happens with nullable dtypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeError: No matching signature found

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank_1d is compiled for uint8_t

Copy link
Contributor Author

@GYHHAHA GYHHAHA Oct 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. My code was tested on 1.4.3.. I find the signature is changed in 1.5.0. as numeric_object_t instead of iu_64_floating_t. Now it works. Thanks. @phofl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worked on 1.4.4, so have to fix this too or revert as suggested in the issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason we can't use pd.isna?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change to pd.isna now

if method == "pearson":
for i, r in enumerate(ndf):
nonnull_mask = ~np.isnan(r) & ~np.isnan(k)
nonnull_mask = ~np.isnan(r) & k_mask
corrs[cols[i]] = np.corrcoef(r[nonnull_mask], k[nonnull_mask])[
0, 1
]
else:
for i, r in enumerate(ndf):
nonnull_mask = ~np.isnan(r) & ~np.isnan(k)
nonnull_mask = ~np.isnan(r) & k_mask
if r.dtype == "bool":
r = r.astype("float")
corrs[cols[i]] = np.corrcoef(
r[nonnull_mask].argsort().argsort(),
k[nonnull_mask].argsort().argsort(),
libalgos.rank_1d(r[nonnull_mask]),
libalgos.rank_1d(k[nonnull_mask]),
)[0, 1]
return Series(corrs)
else:
Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/frame/methods/test_cov_corr.py
Original file line number Diff line number Diff line change
Expand Up @@ -399,6 +399,15 @@ def test_corrwith_spearman(self):
expected = Series(np.ones(len(result)))
tm.assert_series_equal(result, expected)

@td.skip_if_no_scipy
def test_corrwith_spearman_with_tied_data(self):
# GH#48826
df = DataFrame({"A": [2, np.nan, 8, 9], "B": [0, 1, 1, 0]})
s = Series([0, 1, 1, 0])
result = df.corrwith(s, method="spearman")
expected = Series([0.0, 1.0], index=["A", "B"])
tm.assert_series_equal(result, expected)

@td.skip_if_no_scipy
def test_corrwith_kendall(self):
# GH#21925
Expand Down