-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Should groupby.rolling be treated as a transform #51751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A related discussion can be found at #38787. |
This statement from the rolling
(I think Personally, I find it nice that the result index has the corresponding labels associated with the values instead of the original index. If
|
Not quite following here, "they would have to" - I'm guessing "they" is the users, but I'm not sure why they would have to do this. In other words, they would have to do this in order to accomplish what? |
Why not implement a Currently, inserting a
Compared to the
Ideally, something like this would work
|
Sorry, yeah "they" is a user who would want an index equivalent to the current behavior if in the future the index is reset.
I don't think this warrants a separate method entirely |
I see, thanks. I think what I'm missing here is an understanding of a use case where the current index is desirable, but I think that is just due to my lack of experience with using rolling in a groupby. |
Having used I don't immediately see a use case for the current behaviour, but my experience with time series is only limited at this point. |
I guess the use case is the convenience of a "final" result - having a descriptive index pairing a rolling window value with the associated groupby and on label. If it is inconvenient for further manipulation and isn't consistent with other groupby results I would be open to changing. And if the resulting index was changed it would be good to describe the users "do X to get the old result" |
I still do not understand what is the pandas way of getting a rolling average column by group. I also can only see a way of doing something similar to @jpweytjens and relying on indices. This is dangerous. A simple |
My current hack: import pandas as pd
test_df = pd.concat([
pd.DataFrame({
'group': 'A',
'date': pd.date_range('2023-01-01', '2023-01-11'),
'value':
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
}),
pd.DataFrame({
'group': 'B',
'date': pd.date_range('2023-01-01', '2023-01-11'),
'value':
[11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
}),
])
test_df.reset_index(drop=True, inplace=True) # turn off to break it
assert test_df.index.is_unique
test_df.sort_values(by=['group', 'date'], ascending=True, inplace=True) # use ascending=False to break it
test_df['value_ra_3'] = test_df\
.groupby('group', as_index=True)[['group', 'date', 'value']]\
.rolling(window=3, min_periods=3, on='date')['value']\
.agg('mean').reset_index(drop=True) |
What I think should happen instead is: test_df['value_ra_3'] = test_df\
.groupby('group', as_index=True)[['group', 'date', 'value']]\
.rolling(window=3, min_periods=3, on='date')['value']\
.transform('mean') and then we can add it back safely. Which is then also consistent with any regular e.g. test_df['value_mean_group'] = test_df.groupby('group')['value'].transform('mean') which works with no surprises and no implicit index reliance |
p.s. having burned myself enough with those indices on more complicated dataframes than in the example above, I recommend using a join: value_ra_df = test_df\
.groupby('group', as_index=True)[['group', 'date', 'value']]\
.rolling(window=3, min_periods=3, on='date')['value']\
.agg('mean').reset_index(drop=False).rename(columns={'value': 'value_ra_3'})
test_df = test_df.merge(value_ra_df, on=['group', 'date']) at least this will always work |
If we go with the proposal of not including the groups by default, to get the groupings on the result of Somewhat related, in #49543 I proposed essentially having
In #51751 (comment) you can make the very end be |
If it makes things consistent with groupby transforms and pure rolling operations, I would be OK now to not include the groups or on axis in the index of the result. |
so, would just not including the groupings allow for df['rolled_sum'] = df.groupby(['a', 'b']).rolling(on='c').agg('sum') without surprises? |
My preferred approach would be to enable adding the groups to the index/column/neither across all operations but defaulting to the current behavior (aggregations would add to the index, transforms/filters would do neither). Then we could change the default behavior here while still allowing users to easily maintain current behavior, and we wouldn't have a groupby argument (as_index) that gets ignored for certain operations. |
I don't have a lot of familiarity of window ops, but thought this was an interesting question from our Reddit AMA:
When doing
DataFrame.rolling
with a reduction, the operation is a transform. With this, shouldDataFrameGroupBy.rolling
then also be treated like a transform; i.e. not include the groupers oron
column in the result, but rather return the index of the original frame? Like other transforms, users can then use the original object if they want to recover the correspondingon
values or grouping data.cc @mroeschke
The text was updated successfully, but these errors were encountered: