Description
Code Sample, a copy-pastable example if possible
import numpy as np
import pandas as pd
def f(x, y, window):
return x.rolling(window).corr(y)
def g(x, y, window):
return x.rolling(window).apply(lambda x: x.corr(y), raw=False)
if __name__ == "__main__":
N = 20
n = 10
x = pd.Series(np.random.randn(N))
y = 1.0 * x
x[0:n] = 0.
window = 4
print(f(x, y, window)[n + window - 1 :])
>>> 13 1.0
14 1.0
15 1.0
16 1.0
17 1.0
18 1.0
19 1.0
dtype: float64
print(g(x, y, window)[n + window - 1 :])
>>> 13 1.0
14 1.0
15 1.0
16 1.0
17 1.0
18 1.0
19 1.0
index = pd.date_range("2001-01-01", freq="D", periods=N)
x = pd.Series(np.random.randn(N), index=index)
y = 2.0 * x
x[0:n] = 0.
print(f(x, y, window)[n + window - 1 :])
>>> 2001-01-14 1.0
2001-01-15 1.0
2001-01-16 1.0
2001-01-17 1.0
2001-01-18 1.0
2001-01-19 1.0
2001-01-20 1.0
Freq: D, dtype: float64
print(g(x, y, window)[n + window - 1 :])
>>> 2001-01-14 1.0
2001-01-15 1.0
2001-01-16 1.0
2001-01-17 1.0
2001-01-18 1.0
2001-01-19 1.0
2001-01-20 1.0
Freq: D, dtype: float64
dt_window = pd.to_timedelta("4D")
print(f(x, y, dt_window)[n + window - 1 :])
>>> 2001-01-14 0.354308
2001-01-15 0.373106
2001-01-16 0.372752
2001-01-17 0.380531
2001-01-18 0.380298
2001-01-19 0.386142
2001-01-20 0.410147
Freq: D, dtype: float64
print(g(x, y, dt_window)[n + window - 1 :])
>>> 2001-01-14 1.0
2001-01-15 1.0
2001-01-16 1.0
2001-01-17 1.0
2001-01-18 1.0
2001-01-19 1.0
2001-01-20 1.0
Freq: D, dtype: float64
Problem description
Both functions f
and g
should return the same value for entries 13 - 19
in the resulting series.
Currently the result of f
when window = Timedelta(days=4)
is not the correlation between the values of x
and y
which should be equal to 1.0
for entries 13 - 19
in the result.
Computed values on a DataFrame
are also affected, i.e.
df = pd.DataFrame({"x": x, "y": y})
df.rolling(dt_window).corr()
does also compute unexpected values for the crosscorrelation.
If .corr
is replaced with .cov
in f
and g
both functions return identical results, so it is likely that it is caused by a difference in the normalisation in the correlation computation that is applied when using f
and when using g
.
Expected Output
Output of pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit : None
pandas : 0.25.3
numpy : 1.15.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : 0.29.6
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.9.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 2.2.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.1.0
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None