Skip to content

BUG: REGRESSION: DataFrame.corr() floating point inaccuracy #45640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
tritemio opened this issue Jan 26, 2022 · 4 comments · Fixed by #45646
Closed
2 of 3 tasks

BUG: REGRESSION: DataFrame.corr() floating point inaccuracy #45640

tritemio opened this issue Jan 26, 2022 · 4 comments · Fixed by #45646
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@tritemio
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

rng = np.random.default_rng(1)
df = pd.DataFrame(rng.normal(size=(100, 5)))

print(1 - np.diag(df.corr().abs()).min())
# pandas 1.4.0
# -4.440892098500626e-16
# pandas 1.3.5
# 0.0

Issue Description

With pandas 1.4.0, df.corr() returns a matrix where the diagonal is not exactly 1 down to floating point precision.

In pandas 1.3.5 the diagonal of df.corr() was exactly 1.

The example above show the difference.

This causes issues when using the dist = 1 - df.corr().abs() as a distance matrix for clustering. In particular the call to scipy.spatial.distance.squareform(dist) raises an error with pandas 1.4.0 when the dist diagonal is not exactly 0.

Expected Behavior

The diagonal of df.corr() should be exactly 1 down to floating point accuracy

Installed Versions

Replace this line with the output of pd.show_versions()

@tritemio tritemio added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2022
@tritemio tritemio changed the title BUG: REGRESSION: DataFrame.corr() floating point error BUG: REGRESSION: DataFrame.corr() floating point inaccuracy Jan 26, 2022
@phofl
Copy link
Member

phofl commented Jan 26, 2022

This was introduced again by #42761, not sure if we can fix this as it is now. If not we should revert.

cc @mzeitlin11

@phofl phofl added this to the 1.4.1 milestone Jan 26, 2022
@phofl phofl added Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff and removed Needs Triage Issue that has not been reviewed by a pandas team member Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 26, 2022
@tritemio
Copy link
Author

I haven't looked at the code, but would it be possible to skip the computation for the diagonal (by definition a correlation of a vector by itself) and set it to 1 by default? Or overwrite the result and set the diagonal to 1?

@mzeitlin11
Copy link
Member

Sorry about this. +1 for reverting, I think the precision loss is not limited to the diagonal. Can't think of a way to keep the speedup while avoiding this issue. Will put up a pr later this week reverting and adding a test

@phofl
Copy link
Member

phofl commented Jan 26, 2022

Opened a pr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants