Skip to content

BUG: DataFrame.corr assume 1s are always on the diagonal line #43494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
peterpanmj opened this issue Sep 10, 2021 · 2 comments
Closed
2 of 3 tasks

BUG: DataFrame.corr assume 1s are always on the diagonal line #43494

peterpanmj opened this issue Sep 10, 2021 · 2 comments
Labels
Duplicate Report Duplicate issue or pull request Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@peterpanmj
Copy link
Contributor

peterpanmj commented Sep 10, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
from scipy.spatial import distance
df = pd.DataFrame(np.ones([5,3])).corr(distance.jaccard)
print(df)

	0	1	2
0	1.0	0.0	0.0
1	0.0	1.0	0.0
2	0.0	0.0	1.0

Issue Description

pd.DataFrame.corr assume diagnoal line should always be filled with one (one vector with itself). However, for many distance measures, 0 is the result when comparing two identical vectors. I think, we should allow users to specify whether to use 1 for default value on the diagnoal line, or let the underlying custom function to decide

Expected Behavior

[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]

Installed Versions

INSTALLED VERSIONS ------------------ commit : 5f648bf python : 3.7.6.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 1.3.2
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.2
pip : 21.0.1
setuptools : 52.0.0.post20210125
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.0.2
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : 0.9.3
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.26.0
pandas_datareader: 0.9.0
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.2
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

@peterpanmj peterpanmj added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 10, 2021
@peterpanmj peterpanmj changed the title BUG: d.DataFrame.corr assume 1s are always on the diagonal line BUG: DataFrame.corr assume 1s are always on the diagonal line Sep 10, 2021
@peterpanmj
Copy link
Contributor Author

peterpanmj commented Sep 10, 2021

In this example,

print(distance.jaccard(np.ones(5), np.ones(5)))

0.0

@mzeitlin11
Copy link
Member

Thanks for the report @peterpanmj. Going to close in favor of #25781

@mzeitlin11 mzeitlin11 added Duplicate Report Duplicate issue or pull request Numeric Operations Arithmetic, Comparison, and Logical operations and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

2 participants