-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: df.groupy().rolling().cov() does not group properly and the cartesian product of all groups is returned instead #42915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks @Xnot for the report
1.2.5 gives the same result as 1.0.5 first bad commit: [b2671cc] PERF: Rolling/Expanding.cov/corr (#39591) cc @mroeschke |
changing milestone to 1.3.5 |
From test_rolling_corr_cov, it looks like the current behavior tested. But it does appear undesirable to me, originating from this block: pandas/pandas/core/window/rolling.py Lines 750 to 753 in 1cbe011
As @Xnot points out, when reindexing here we are taking, for each group, the entirety of the result. But result contains the results for all groups, so I'm not sure why we'd want to do that. Maybe this entire block can be removed? |
on master gives the same result as
on 1.2.5 I assume these two should be equivalent and that the 1.2.5 behavior is therefore correct and |
I find this convincing, however I'm still confused by
Now these two give the same result!
In the above, the NaNs in group 1 from index 5-9 must always be NaN because this is where idx1 takes on the value 2. Similar remarks apply to group 2 from index 0-4. This again looks like incorrect output, but coming from Edit: Removing the block I highlighted in #42915 (comment) gives what I think is the expected result:
|
cc @mroeschke for any thoughts. |
@rhshadrach I believe the result might be correct for this specific example though. If we return to the example in the OP and apply the operation to each group individually: df_a = pd.DataFrame({"value": range(10), "idx1": [1] * 5 + [2] * 5, "idx2": [1, 2, 3, 4, 5] * 2}).set_index(["idx1", "idx2"])
df_b = pd.DataFrame({"value": range(5), "idx2": [1, 2, 3, 4, 5]}).set_index("idx2")
test = df_a.groupby(level=0)
for idx, group in test:
print(group.rolling(2).cov(df_b)) In this case, each group is the same size as
However, in your example you are doing the covariance with the whole of for idx, group in test:
print(group.rolling(2).cov(df_a)) In this case, the group in each iteration is expanded to the size of
I believe this is the correct behavior, and it is consistent with other operations: print("short results:")
for idx, group in test:
print(group + df_b)
print("expanded results:")
for idx, group in test:
print(group + df_a)
So |
Following up on the discussion we had at the pandas dev meeting on 12/8/2021, I think that the results prior to pandas 1.3 were correct in terms of the semantic meaning of |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
On pandas 1.3.1:
On pandas 1.0.5:
1.0.5 has the correct expected output. In 1.3.1, everything in index 1/2/x and 2/1/x is useless and will always be NaN.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit : c7f7443
python : 3.9.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.1.2
setuptools : 54.1.2
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : 2.9.1 (dt dec pq3 ext lo64)
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.22
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None
The text was updated successfully, but these errors were encountered: