-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Preferred MultiIndex result for groupby().rolling() for an object with a MultiIndex #38787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I made a notebook exploring a few different cases (both numeric as non-numeric groupby key, as column or as index), looking at the rolling results across recent versions, and comparing to apply/transform: https://nbviewer.jupyter.org/gist/jorisvandenbossche/38bf6bb776092bc6fda2b2967b85b43d (still need to make a summary of it / distill my thoughts for the actual API question, but that will be for later) |
I've labelled as blocker and milestoned as 1.2.1 as the backport #38784 is currently blocked by this discussion. |
#38784 merged. blocker tag remains as 1st patch release on 1.2.x could be anytime dependant on severity of regressions. |
|
The blocker here is deciding if the backport currently on 1.2.x is going to stay. |
|
moving to 1.2.3. @jorisvandenbossche we have a meeting Wednesday? maybe another quick discussion on this. |
This change bit me on some code that I had with version 1.0.5 of pandas. But here's an example where it is really useful to preserve the second index. In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '1.3.0.dev0+1182.g3c3589b6f9'
In [3]: df=pd.read_csv("head30.csv.gz", index_col=[0,1])
In [4]: df
Out[4]:
newcases
state date
Alabama 2020-03-13 0.0
2020-03-14 6.0
2020-03-15 11.0
2020-03-16 6.0
2020-03-17 10.0
... ...
Wyoming 2020-04-05 11.0
2020-04-06 13.0
2020-04-07 8.0
2020-04-08 9.0
2020-04-09 73.0
[1650 rows x 1 columns]
In [5]: df.groupby("state").rolling(7, min_periods=1).mean()
Out[5]:
newcases
state
Alabama 0.000000
Alabama 3.000000
Alabama 5.666667
Alabama 5.750000
Alabama 6.600000
... ...
Wyoming 16.000000
Wyoming 16.714286
Wyoming 14.285714
Wyoming 13.142857
Wyoming 21.285714
[1650 rows x 1 columns] Here, the data is by state and by date. I want the rolling average by state, knowing what the rolling average is on each date. With the current version (1.2.3) and the master version, I lose the dates. So you have to do a In [6]: df.groupby("state").rolling(7, min_periods=1).mean().set_index(df.index)
Out[6]:
newcases
state date
Alabama 2020-03-13 0.000000
2020-03-14 3.000000
2020-03-15 5.666667
2020-03-16 5.750000
2020-03-17 6.600000
... ...
Wyoming 2020-04-05 16.000000
2020-04-06 16.714286
2020-04-07 14.285714
2020-04-08 13.142857
2020-04-09 21.285714
[1650 rows x 1 columns] And here is what happened with pandas 1.0.5 (which is not great, because it duplicates the In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '1.0.5'
In [3]: df=pd.read_csv("head30.csv.gz", index_col=[0,1])
In [4]: df
Out[4]:
newcases
state date
Alabama 2020-03-13 0.0
2020-03-14 6.0
2020-03-15 11.0
2020-03-16 6.0
2020-03-17 10.0
... ...
Wyoming 2020-04-05 11.0
2020-04-06 13.0
2020-04-07 8.0
2020-04-08 9.0
2020-04-09 73.0
[1650 rows x 1 columns]
In [5]: df.groupby("state").rolling(7, min_periods=1).mean()
Out[5]:
newcases
state state date
Alabama Alabama 2020-03-13 0.000000
2020-03-14 3.000000
2020-03-15 5.666667
2020-03-16 5.750000
2020-03-17 6.600000
... ...
Wyoming Wyoming 2020-04-05 16.000000
2020-04-06 16.714286
2020-04-07 14.285714
2020-04-08 13.142857
2020-04-09 21.285714
[1650 rows x 1 columns] |
groupby().rolling()
in master currently constructs the resultingMultiIndex
manually by insertinggroupby
keys as the first level(s) and then the original object'sIndex
as the second level(s).However,
groupby().rolling()
behaves similarly togroupby().transform()
(i.e. maintains the original shape), so should the resulting index align with results ofgroupby().transform()
?As shown, when the original object as a
MultiIndex
, there is consistency of the resultingMultiIndex
for thegroupby().rolling()
result but can lead to redundancy. There is lack of consistency of the resultingMultiIndex
for thegroupby().transform()
result but looks more convenient.IMO I prefer the consistent result we have today in
groupby().rolling()
but open to thoughts.The text was updated successfully, but these errors were encountered: