Skip to content

Numerical issue with rolling cov and corr #24019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
xiaochuanzhao opened this issue Nov 30, 2018 · 3 comments
Open

Numerical issue with rolling cov and corr #24019

xiaochuanzhao opened this issue Nov 30, 2018 · 3 comments
Labels
Bug cov/corr Reduction Operations sum, mean, min, max, etc. Window rolling, ewma, expanding

Comments

@xiaochuanzhao
Copy link

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd
a = pd.Series([1e5, 0, 0, 0, 0])
b = pd.Series([9.45] * 5)
c1 = a.rolling(5).corr(b).iloc[4]
c2 = a.corr(b)
v1 = a.rolling(5).cov(b).iloc[4]
v2 = a.cov(b)
assert c1 == c2
assert v1 == v2

Problem description

I came across a strange behavior of Pandas rolling correlation. In the code snippet below, I'd assume v1 == v2 is true but it turns out not. This causes inf in rolling correlation (c1 vs. c2, where c2 is fine but c1 is "wrong" in my opinion). Since the standard deviation of a constant sequence is 0, the correlation between it and any other sequence would be a 0/0. Returning a nan as what the vanilla corr does is fine, but returning inf is annoying and misleading

Expected Output

assertions pass

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 4.0.0
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.8
feather: None
matplotlib: None
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@mroeschke mroeschke added Bug Numeric Operations Arithmetic, Comparison, and Logical operations Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Window rolling, ewma, expanding labels Dec 2, 2018
@mroeschke mroeschke removed the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label May 8, 2020
@navicor90
Copy link

navicor90 commented Oct 12, 2020

Hi, I was reading the implementation of the cov function and at the end is:
(mean(X * Y) - mean(X) * mean(Y))
where X=a and Y=b in your situation.
You need to know that in cov function a and b series are casted as type float64.
And as the flaoting point python3 implementation has some limitations, the subtraction between mean(X*Y) - mean(X)*mean(Y) is not exactly zero.

I thought that this other implementation avoid this kind of situations:
(summarize(X * Y) - (n * mean(X) * mean(Y))) * bias_adj
I tested it and it returned zero.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2020

yep would take a patch for this

@navicor90
Copy link

navicor90 commented Oct 14, 2020

I was trying to implement this, but I found many other situations where (summarize(X * Y) - (n * mean(X) * mean(Y))) * bias_adj still not return exactly the expected value.

Why this happen? Because we still have fractions (mean(X) and mean(Y)) I looked for another ways to do the math, but always you need to preserve at least one fraction. So eventually the imprecision happens.

As workaround, the user could use the round,floor or ceil functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug cov/corr Reduction Operations sum, mean, min, max, etc. Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

5 participants