Skip to content

BUG: The behavior for DataFrame.corrwith is changed in pandas 1.5.0 ?? #48826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
itholic opened this issue Sep 28, 2022 · 5 comments · Fixed by #49140
Closed
3 tasks done

BUG: The behavior for DataFrame.corrwith is changed in pandas 1.5.0 ?? #48826

itholic opened this issue Sep 28, 2022 · 5 comments · Fixed by #49140
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@itholic
Copy link

itholic commented Sep 28, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {"A": [1, np.nan, 7, 8], "B": [False, True, True, False], "C": [10, 4, 9, 3]}
)
df2 = df1[["B", "C"]]
(df1 + 1).corrwith(df2.B, method="spearman")

Issue Description

Seems like the behavior of pandas.DataFrame.corrwith is changed from pandas 1.5.0:

The result from the reproducible example in pandas 1.4.0:

>>> (df1 + 1).corrwith(df2.B, method="spearman")
A    0.0
B    1.0
C    0.0
dtype: float64

However, the result from the reproducible example in pandas 1.5.0:

>>> (df1 + 1).corrwith(df2.B, method="spearman")
A    0.5
B    1.0
C   -0.2
dtype: float64

I see the #46174 that improve the performance, but maybe the behavior also changed?

Expected Behavior

The expected behavior in the previous pandas was:

>>> (df1 + 1).corrwith(df2.B, method="spearman")
A    0.0
B    1.0
C    0.0
dtype: float64

Installed Versions

INSTALLED VERSIONS

commit : 87cfe4e
python : 3.9.11.final.0
python-bits : 64
OS : Darwin
OS-release : 21.6.0
Version : Darwin Kernel Version 21.6.0: Wed Aug 10 14:25:27 PDT 2022; root:xnu-8020.141.5~2/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.5.0
numpy : 1.22.3
pytz : 2021.3
dateutil : 2.8.2
setuptools : 61.2.0
pip : 22.2.2
Cython : 0.29.24
pytest : 6.2.5
hypothesis : None
sphinx : 3.0.4
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.4
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.2
snappy : None
sqlalchemy : 1.4.26
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None

@itholic itholic added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 28, 2022
@MarcoGorelli
Copy link
Member

Hey, thanks for the report

I see the #46174 that improve the performance, but maybe the behavior also changed?

Yup, git bisect confirms

@MarcoGorelli
Copy link
Member

cc @fractionalhare (no blame, just in case you know how to fix it)

@MarcoGorelli MarcoGorelli added Regression Functionality that used to work in a prior pandas version Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 28, 2022
@MarcoGorelli MarcoGorelli added this to the 1.5.1 milestone Sep 28, 2022
@fractionalhare
Copy link
Contributor

Thanks for flagging. I can look later today after work. This is strange, my guess is an edge case not covered by tests before. I think it involves the NaN not being ranked by argsort for the Spearman calculation. Will confirm later.

@phofl
Copy link
Member

phofl commented Sep 29, 2022

Reverting would be an option, if we can not figure out what caused this

HyukjinKwon pushed a commit to apache/spark that referenced this issue Nov 1, 2022
…fixing in future pandas

### What changes were proposed in this pull request?

This PR proposes to make the manual tests for `DataFrame.corrwith` back into formal approach, if the pandas version is not 1.5.0.

### Why are the changes needed?

There was a regression introduced by pandas 1.5.0 (pandas-dev/pandas#48826), and seems it's resolved now.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The fixed test should pass the CI.

Closes #38455 from itholic/SPARK-40827.

Authored-by: itholic <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this issue Dec 12, 2022
…fixing in future pandas

### What changes were proposed in this pull request?

This PR proposes to make the manual tests for `DataFrame.corrwith` back into formal approach, if the pandas version is not 1.5.0.

### Why are the changes needed?

There was a regression introduced by pandas 1.5.0 (pandas-dev/pandas#48826), and seems it's resolved now.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The fixed test should pass the CI.

Closes apache#38455 from itholic/SPARK-40827.

Authored-by: itholic <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
@MarcoGorelli
Copy link
Member

Hi @itholic

Thanks for having reported this - there's now a release candidate for pandas 2.0.0, which you can install with

mamba install -c conda-forge/label/pandas_rc pandas==2.0.0rc0

or

python -m pip install --upgrade --pre pandas==2.0.0rc0

If you try it out and report bugs, then we can fix them before the final 2.0.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Regression Functionality that used to work in a prior pandas version
Projects
None yet
4 participants