Skip to content

BUG: Spearman correlation is broken (dtype mismatch) on 32-bit platforms #43588

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
musicinmybrain opened this issue Sep 15, 2021 · 7 comments · Fixed by #43608
Closed
2 of 3 tasks

BUG: Spearman correlation is broken (dtype mismatch) on 32-bit platforms #43588

musicinmybrain opened this issue Sep 15, 2021 · 7 comments · Fixed by #43608
Labels
32bit 32-bit systems Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@musicinmybrain
Copy link
Contributor

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
d = DataFrame([1.0, 2.0])
d.corr(method='spearman')

Issue Description

Calling the corr method of a DataFrame with method='spearman' produces a ValueError due to a buffer dtype mismatch on 32-bit platforms.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.10/site-packages/pandas/core/frame.py", line 9376, in corr
    correl = libalgos.nancorr_spearman(mat, minp=min_periods)
  File "pandas/_libs/algos.pyx", line 415, in pandas._libs.algos.nancorr_spearman
  File "pandas/_libs/algos.pyx", line 938, in pandas._libs.algos.rank_1d
ValueError: Buffer dtype mismatch, expected 'const intp_t' but got 'long long'

If I have some time, I’ll look into this further and try to offer a PR. The problem was discovered due to a failing test in pingouin (raphaelvallat/pingouin#197).

I have reproduced this on both 32-bit x86 and 32-bit ARM. While my “installed versions” are those currently in Fedora Rawhide, including Pandas 1.3.0, I did build an RPM for Pandas 1.3.3 and reproduce with that too.

Expected Behavior

     0
0  1.0

Installed Versions

INSTALLED VERSIONS

commit : f00ed8f
python : 3.10.0.candidate.2
python-bits : 32
OS : Linux
OS-release : 5.13.14-200.fc34.x86_64
Version : #1 SMP Fri Sep 3 15:33:01 UTC 2021
machine : armv7l
processor : armv7l
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.1
pip : None
setuptools : 57.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.5.0b1
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@musicinmybrain musicinmybrain added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2021
@mzeitlin11
Copy link
Member

Thanks for reporting this @musicinmybrain! There were some prs targeting rank_1d and nancorr_spearman in 1.3, along with a movement to go from int64 to intp in our code. My best guess for the cause of this regression is #40635, which changed rank_1d to take intp labels, but nancorr_spearman still passed int64 labels, hence the 32-bit failure.

If you have access to a 32-bit machine and some time, would you mind testing on master? I believe this should be fixed because nancorr_spearman no longer passes labels to rank_1d.

Unfortunately, since a bunch of changes have occurred around this code, not sure how to easily backport this fix.

cc @jreback is it ok to make a pr which targets only 1.3.x, but not master? Would be a really small change (I think) relative to 1.3.x branch, just changing one int64 to an intp

@mzeitlin11 mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2021
@mzeitlin11 mzeitlin11 added this to the 1.3.4 milestone Sep 15, 2021
@mzeitlin11 mzeitlin11 added the 32bit 32-bit systems label Sep 15, 2021
@jreback
Copy link
Contributor

jreback commented Sep 15, 2021

does the problem not exist in master?

@musicinmybrain
Copy link
Contributor Author

Yes, I was about to follow up after spending a little more time reading through the source code and recent changes. I also noticed that the call site where the exception happened has changed since 1.3.3.

I’ll try to build the current master and see if it works as expected.

Meanwhile, I think your analysis of the problem in 1.3.x is correct.

@mzeitlin11
Copy link
Member

I don't think so, but can't easily confirm without 32-bit machine

@musicinmybrain
Copy link
Contributor Author

Current master (5872bfe) works as expected on 32-bit x86.

@mzeitlin11
Copy link
Member

Thanks for checking! I'll try to put up a quick pr in next couple days with a regression test for master and the fix which can be backported to 1.3.x

@musicinmybrain
Copy link
Contributor Author

Thanks! I appreciate your efforts.

Once a fix is available, I’ll work with the maintainers of the python-pandas package in Fedora Linux to try to make sure it is part of the upcoming Fedora 35 release. The current Fedora 34 release has pandas 1.2.5, which (I’ve verified) predates the regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
32bit 32-bit systems Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants