API: Should groupby.rolling be treated as a transform #51751

rhshadrach · 2023-03-03T02:54:13Z

I don't have a lot of familiarity of window ops, but thought this was an interesting question from our Reddit AMA:

# example DataFrame with 3 columns: date, id and a random value
dates = list(pd.date_range(start="2019-01-01", end="2019-12-01", freq="MS"))
length = len(dates)
n = 2
ids = sorted(list(range(n)) * length)
values = np.random.randint(low=0, high=10, size=length).tolist()
df = pd.DataFrame({"date": dates * n, "id": ids, "value": values * n})

print(df.head())
#         date  id  value
# 0 2019-01-01   0      8
# 1 2019-02-01   0      8
# 2 2019-03-01   0      7
# 3 2019-04-01   0      2
# 4 2019-05-01   0      1

print(df.rolling(3, on="date")["value"].max().head())
# 0    NaN
# 1    NaN
# 2    8.0
# 3    8.0
# 4    7.0
# Name: value, dtype: float64

print(df.groupby("id").rolling(3, on="date")["value"].max().head())
# id  date      
# 0   2019-01-01    NaN
#     2019-02-01    NaN
#     2019-03-01    8.0
#     2019-04-01    8.0
#     2019-05-01    7.0
# Name: value, dtype: float64

When doing DataFrame.rolling with a reduction, the operation is a transform. With this, should DataFrameGroupBy.rolling then also be treated like a transform; i.e. not include the groupers or on column in the result, but rather return the index of the original frame? Like other transforms, users can then use the original object if they want to recover the corresponding on values or grouping data.

cc @mroeschke

The text was updated successfully, but these errors were encountered:

jpweytjens · 2023-03-06T09:17:12Z

A related discussion can be found at #38787.

mroeschke · 2023-03-06T09:29:02Z

This statement from the rolling on parameter docs might have a typo, but I would assume applies to groupby rolling too

Provided integer column is ignored and excluded from result since an integer index is not used to calculate the rolling window.

(I think integer column should be integer index?)

Personally, I find it nice that the result index has the corresponding labels associated with the values instead of the original index. If groupby rolling didn't include the groupers and on columns, they would have to possibly sort the grouping labels and do a take on the on column. (This is how we validate on is monotonic)?

    def _validate_datetimelike_monotonic(self):
        """
        Validate that each group in self._on is monotonic
        """
        # GH 46061
        if self._on.hasnans:
            self._raise_monotonic_error("values must not have NaT")
        for group_indices in self._grouper.indices.values():
            group_on = self._on.take(group_indices)
            if not (
                group_on.is_monotonic_increasing or group_on.is_monotonic_decreasing
            ):
                on = "index" if self.on is None else self.on
                raise ValueError(
                    f"Each group within {on} must be monotonic. "
                    f"Sort the values in {on} first."
                )

rhshadrach · 2023-03-06T21:45:41Z

If groupby rolling didn't include the groupers and on columns, they would have to possibly sort the grouping labels and do a take on the on column. (This is how we validate on is monotonic)?

Not quite following here, "they would have to" - I'm guessing "they" is the users, but I'm not sure why they would have to do this. In other words, they would have to do this in order to accomplish what?

jpweytjens · 2023-03-06T22:40:29Z

Why not implement a .transform() method on the groupby().rolling() object that returns a Series with the same index as the oringinal df? This would provide a workflow very similar to the one that .groupby().transform() already provides.

Currently, inserting a df.groupby(...).rolling(...) back into df is a bit cumbersome at the moment. I only know of these 2 methods.

df["rolling_max_per_id_v1"] = df.set_index("date").groupby("id", as_index=False)["value"].rolling(3).max()["value"]
df["rolling_max_per_id_v2"] = df.groupby("id").rolling(3, on="date")["value"].max().set_axis(df.index)

Compared to the groupby().transform()

df["max_per_id"] = df.groupby("id")["value"].transform("max")

Ideally, something like this would work

df["rolling_max_per_id_ideal"] = df.groupby("id)["value"].rolling(3, on="date").transform("max")

mroeschke · 2023-03-07T23:24:41Z

Not quite following here, "they would have to" - I'm guessing "they" is the users, but I'm not sure why they would have to do this. In other words, they would have to do this in order to accomplish what?

Sorry, yeah "they" is a user who would want an index equivalent to the current behavior if in the future the index is reset.

Why not implement a .transform() method on the groupby().rolling() object that returns a Series with the same index as the oringinal df?

I don't think this warrants a separate method entirely

rhshadrach · 2023-03-09T22:15:42Z

"they" is a user who would want an index equivalent to the current behavior if in the future the index is reset.

I see, thanks. I think what I'm missing here is an understanding of a use case where the current index is desirable, but I think that is just due to my lack of experience with using rolling in a groupby.

jpweytjens · 2023-03-09T22:58:11Z

Having used groupby without rolling, I was suprised to see a different index than when using groupby with a rolling. It's unclear to me why rolling would change this behaviour?

I don't immediately see a use case for the current behaviour, but my experience with time series is only limited at this point.

mroeschke · 2023-03-09T23:49:45Z

I guess the use case is the convenience of a "final" result - having a descriptive index pairing a rolling window value with the associated groupby and on label.

If it is inconvenient for further manipulation and isn't consistent with other groupby results I would be open to changing. And if the resulting index was changed it would be good to describe the users "do X to get the old result"

Demetrio92 · 2023-12-20T19:58:53Z

I still do not understand what is the pandas way of getting a rolling average column by group. I also can only see a way of doing something similar to @jpweytjens and relying on indices.

This is dangerous. A simple .sort_values(ascending=False) breaks it. So does having a non-unique index.

Demetrio92 · 2023-12-20T20:00:31Z

My current hack:

import pandas as pd
test_df = pd.concat([
    pd.DataFrame({
        'group': 'A',
        'date': pd.date_range('2023-01-01', '2023-01-11'),
        'value':
            [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    }),
    pd.DataFrame({
        'group': 'B',
        'date': pd.date_range('2023-01-01', '2023-01-11'),
        'value':
            [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
    }),
])

test_df.reset_index(drop=True, inplace=True)  # turn off to break it
assert test_df.index.is_unique  
test_df.sort_values(by=['group', 'date'], ascending=True, inplace=True)  # use ascending=False to break it
test_df['value_ra_3'] = test_df\
    .groupby('group', as_index=True)[['group', 'date', 'value']]\
    .rolling(window=3, min_periods=3, on='date')['value']\
    .agg('mean').reset_index(drop=True)

Demetrio92 · 2023-12-20T20:04:52Z

What I think should happen instead is:

test_df['value_ra_3'] = test_df\
    .groupby('group', as_index=True)[['group', 'date', 'value']]\
    .rolling(window=3, min_periods=3, on='date')['value']\
    .transform('mean')

and then we can add it back safely. Which is then also consistent with any regular groupby().transform()

e.g.

test_df['value_mean_group'] = test_df.groupby('group')['value'].transform('mean')

which works with no surprises and no implicit index reliance

Demetrio92 · 2023-12-20T20:27:59Z

p.s. having burned myself enough with those indices on more complicated dataframes than in the example above, I recommend using a join:

value_ra_df = test_df\
    .groupby('group', as_index=True)[['group', 'date', 'value']]\
    .rolling(window=3, min_periods=3, on='date')['value']\
    .agg('mean').reset_index(drop=False).rename(columns={'value': 'value_ra_3'})

test_df = test_df.merge(value_ra_df, on=['group', 'date'])

at least this will always work

rhshadrach · 2023-12-20T21:58:09Z

@mroeschke

If it is inconvenient for further manipulation and isn't consistent with other groupby results I would be open to changing. And if the resulting index was changed it would be good to describe the users "do X to get the old result"

If we go with the proposal of not including the groups by default, to get the groupings on the result of result = df.groupby(['a', 'b']).rolling(on='c').sum(), you should just do result[['a', 'b', 'c']] = df[['a', 'b', 'c']]. For the user who wants the result as a separate object, this is a bit inconvenient. But for the user who wants to add the result back to the input df, it is more convenient. df['rolled_sum'] = df.groupby(['a', 'b']).rolling(on='c').sum().

Somewhat related, in #49543 I proposed essentially having as_index=[True|False] be adhered to for transforms rather than ignored.

@Demetrio92

having burned myself enough with those indices on more complicated dataframes than in the example above, I recommend using a join:

In #51751 (comment) you can make the very end be .reset_index(level=0, drop=True) but this would only work for a single grouping (e.g. df.groupby('a') and not df.groupby(['a', 'b']).

mroeschke · 2023-12-20T23:42:33Z

If it makes things consistent with groupby transforms and pure rolling operations, I would be OK now to not include the groups or on axis in the index of the result.

Demetrio92 · 2023-12-21T17:06:37Z

so, would just not including the groupings allow for

df['rolled_sum'] = df.groupby(['a', 'b']).rolling(on='c').agg('sum')

without surprises?

rhshadrach · 2023-12-21T23:40:35Z

My preferred approach would be to enable adding the groups to the index/column/neither across all operations but defaulting to the current behavior (aggregations would add to the index, transforms/filters would do neither). Then we could change the default behavior here while still allowing users to easily maintain current behavior, and we wouldn't have a groupby argument (as_index) that gets ignored for certain operations.

rhshadrach added Groupby Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding API - Consistency Internal Consistency of API/Behavior labels Mar 3, 2023

rhshadrach changed the title ~~ENH: Should groupby.rolling be treated as a transform~~ API: Should groupby.rolling be treated as a transform Mar 3, 2023

rhshadrach mentioned this issue Mar 1, 2024

BUG: Groupby followed by calculations like rolling, ewm, expanding should not duplicate the group keys index #57690

Closed

3 tasks

rhshadrach mentioned this issue Apr 26, 2024

BUG: GroupByRolling duplicates MultiIndex levels, which prevents assignment on parent #58444

Closed

3 tasks

rhshadrach mentioned this issue Sep 24, 2024

BUG: groupby results in index with duplicate levels even when group_keys=False #59881

Closed

3 tasks

rhshadrach mentioned this issue Dec 9, 2024

BUG: inconsistent indices in GroupByRolling when selecting or not selecting subset of columns #59567

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Should groupby.rolling be treated as a transform #51751

API: Should groupby.rolling be treated as a transform #51751

rhshadrach commented Mar 3, 2023

jpweytjens commented Mar 6, 2023

mroeschke commented Mar 6, 2023

rhshadrach commented Mar 6, 2023

jpweytjens commented Mar 6, 2023

mroeschke commented Mar 7, 2023

rhshadrach commented Mar 9, 2023

jpweytjens commented Mar 9, 2023 •

edited

Loading

mroeschke commented Mar 9, 2023

Demetrio92 commented Dec 20, 2023

Demetrio92 commented Dec 20, 2023

Demetrio92 commented Dec 20, 2023

Demetrio92 commented Dec 20, 2023

rhshadrach commented Dec 20, 2023

mroeschke commented Dec 20, 2023

Demetrio92 commented Dec 21, 2023 •

edited

Loading

rhshadrach commented Dec 21, 2023

API: Should groupby.rolling be treated as a transform #51751

API: Should groupby.rolling be treated as a transform #51751

Comments

rhshadrach commented Mar 3, 2023

jpweytjens commented Mar 6, 2023

mroeschke commented Mar 6, 2023

rhshadrach commented Mar 6, 2023

jpweytjens commented Mar 6, 2023

mroeschke commented Mar 7, 2023

rhshadrach commented Mar 9, 2023

jpweytjens commented Mar 9, 2023 • edited Loading

mroeschke commented Mar 9, 2023

Demetrio92 commented Dec 20, 2023

Demetrio92 commented Dec 20, 2023

Demetrio92 commented Dec 20, 2023

Demetrio92 commented Dec 20, 2023

rhshadrach commented Dec 20, 2023

mroeschke commented Dec 20, 2023

Demetrio92 commented Dec 21, 2023 • edited Loading

rhshadrach commented Dec 21, 2023

jpweytjens commented Mar 9, 2023 •

edited

Loading

Demetrio92 commented Dec 21, 2023 •

edited

Loading