Skip to content

API: Should groupby.rolling be treated as a transform #51751

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rhshadrach opened this issue Mar 3, 2023 · 16 comments
Open

API: Should groupby.rolling be treated as a transform #51751

rhshadrach opened this issue Mar 3, 2023 · 16 comments
Labels
API - Consistency Internal Consistency of API/Behavior Groupby Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding

Comments

@rhshadrach
Copy link
Member

I don't have a lot of familiarity of window ops, but thought this was an interesting question from our Reddit AMA:

# example DataFrame with 3 columns: date, id and a random value
dates = list(pd.date_range(start="2019-01-01", end="2019-12-01", freq="MS"))
length = len(dates)
n = 2
ids = sorted(list(range(n)) * length)
values = np.random.randint(low=0, high=10, size=length).tolist()
df = pd.DataFrame({"date": dates * n, "id": ids, "value": values * n})

print(df.head())
#         date  id  value
# 0 2019-01-01   0      8
# 1 2019-02-01   0      8
# 2 2019-03-01   0      7
# 3 2019-04-01   0      2
# 4 2019-05-01   0      1

print(df.rolling(3, on="date")["value"].max().head())
# 0    NaN
# 1    NaN
# 2    8.0
# 3    8.0
# 4    7.0
# Name: value, dtype: float64

print(df.groupby("id").rolling(3, on="date")["value"].max().head())
# id  date      
# 0   2019-01-01    NaN
#     2019-02-01    NaN
#     2019-03-01    8.0
#     2019-04-01    8.0
#     2019-05-01    7.0
# Name: value, dtype: float64

When doing DataFrame.rolling with a reduction, the operation is a transform. With this, should DataFrameGroupBy.rolling then also be treated like a transform; i.e. not include the groupers or on column in the result, but rather return the index of the original frame? Like other transforms, users can then use the original object if they want to recover the corresponding on values or grouping data.

cc @mroeschke

@rhshadrach rhshadrach added Groupby Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding API - Consistency Internal Consistency of API/Behavior labels Mar 3, 2023
@rhshadrach rhshadrach changed the title ENH: Should groupby.rolling be treated as a transform API: Should groupby.rolling be treated as a transform Mar 3, 2023
@jpweytjens
Copy link

A related discussion can be found at #38787.

@mroeschke
Copy link
Member

This statement from the rolling on parameter docs might have a typo, but I would assume applies to groupby rolling too

Provided integer column is ignored and excluded from result since an integer index is not used to calculate the rolling window.

(I think integer column should be integer index?)

Personally, I find it nice that the result index has the corresponding labels associated with the values instead of the original index. If groupby rolling didn't include the groupers and on columns, they would have to possibly sort the grouping labels and do a take on the on column. (This is how we validate on is monotonic)?

    def _validate_datetimelike_monotonic(self):
        """
        Validate that each group in self._on is monotonic
        """
        # GH 46061
        if self._on.hasnans:
            self._raise_monotonic_error("values must not have NaT")
        for group_indices in self._grouper.indices.values():
            group_on = self._on.take(group_indices)
            if not (
                group_on.is_monotonic_increasing or group_on.is_monotonic_decreasing
            ):
                on = "index" if self.on is None else self.on
                raise ValueError(
                    f"Each group within {on} must be monotonic. "
                    f"Sort the values in {on} first."
                )

@rhshadrach
Copy link
Member Author

If groupby rolling didn't include the groupers and on columns, they would have to possibly sort the grouping labels and do a take on the on column. (This is how we validate on is monotonic)?

Not quite following here, "they would have to" - I'm guessing "they" is the users, but I'm not sure why they would have to do this. In other words, they would have to do this in order to accomplish what?

@jpweytjens
Copy link

Why not implement a .transform() method on the groupby().rolling() object that returns a Series with the same index as the oringinal df? This would provide a workflow very similar to the one that .groupby().transform() already provides.

Currently, inserting a df.groupby(...).rolling(...) back into df is a bit cumbersome at the moment. I only know of these 2 methods.

df["rolling_max_per_id_v1"] = df.set_index("date").groupby("id", as_index=False)["value"].rolling(3).max()["value"]
df["rolling_max_per_id_v2"] = df.groupby("id").rolling(3, on="date")["value"].max().set_axis(df.index)

Compared to the groupby().transform()

df["max_per_id"] = df.groupby("id")["value"].transform("max")

Ideally, something like this would work

df["rolling_max_per_id_ideal"] = df.groupby("id)["value"].rolling(3, on="date").transform("max")

@mroeschke
Copy link
Member

Not quite following here, "they would have to" - I'm guessing "they" is the users, but I'm not sure why they would have to do this. In other words, they would have to do this in order to accomplish what?

Sorry, yeah "they" is a user who would want an index equivalent to the current behavior if in the future the index is reset.

Why not implement a .transform() method on the groupby().rolling() object that returns a Series with the same index as the oringinal df?

I don't think this warrants a separate method entirely

@rhshadrach
Copy link
Member Author

"they" is a user who would want an index equivalent to the current behavior if in the future the index is reset.

I see, thanks. I think what I'm missing here is an understanding of a use case where the current index is desirable, but I think that is just due to my lack of experience with using rolling in a groupby.

@jpweytjens
Copy link

jpweytjens commented Mar 9, 2023

Having used groupby without rolling, I was suprised to see a different index than when using groupby with a rolling. It's unclear to me why rolling would change this behaviour?

I don't immediately see a use case for the current behaviour, but my experience with time series is only limited at this point.

@mroeschke
Copy link
Member

I guess the use case is the convenience of a "final" result - having a descriptive index pairing a rolling window value with the associated groupby and on label.

If it is inconvenient for further manipulation and isn't consistent with other groupby results I would be open to changing. And if the resulting index was changed it would be good to describe the users "do X to get the old result"

@Demetrio92
Copy link

I still do not understand what is the pandas way of getting a rolling average column by group. I also can only see a way of doing something similar to @jpweytjens and relying on indices.

This is dangerous. A simple .sort_values(ascending=False) breaks it. So does having a non-unique index.

@Demetrio92
Copy link

My current hack:

import pandas as pd
test_df = pd.concat([
    pd.DataFrame({
        'group': 'A',
        'date': pd.date_range('2023-01-01', '2023-01-11'),
        'value':
            [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    }),
    pd.DataFrame({
        'group': 'B',
        'date': pd.date_range('2023-01-01', '2023-01-11'),
        'value':
            [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
    }),
])

test_df.reset_index(drop=True, inplace=True)  # turn off to break it
assert test_df.index.is_unique  
test_df.sort_values(by=['group', 'date'], ascending=True, inplace=True)  # use ascending=False to break it
test_df['value_ra_3'] = test_df\
    .groupby('group', as_index=True)[['group', 'date', 'value']]\
    .rolling(window=3, min_periods=3, on='date')['value']\
    .agg('mean').reset_index(drop=True)

@Demetrio92
Copy link

What I think should happen instead is:

test_df['value_ra_3'] = test_df\
    .groupby('group', as_index=True)[['group', 'date', 'value']]\
    .rolling(window=3, min_periods=3, on='date')['value']\
    .transform('mean')

and then we can add it back safely. Which is then also consistent with any regular groupby().transform()

e.g.

test_df['value_mean_group'] = test_df.groupby('group')['value'].transform('mean')

which works with no surprises and no implicit index reliance

@Demetrio92
Copy link

p.s. having burned myself enough with those indices on more complicated dataframes than in the example above, I recommend using a join:

value_ra_df = test_df\
    .groupby('group', as_index=True)[['group', 'date', 'value']]\
    .rolling(window=3, min_periods=3, on='date')['value']\
    .agg('mean').reset_index(drop=False).rename(columns={'value': 'value_ra_3'})

test_df = test_df.merge(value_ra_df, on=['group', 'date'])

at least this will always work

@rhshadrach
Copy link
Member Author

@mroeschke

If it is inconvenient for further manipulation and isn't consistent with other groupby results I would be open to changing. And if the resulting index was changed it would be good to describe the users "do X to get the old result"

If we go with the proposal of not including the groups by default, to get the groupings on the result of result = df.groupby(['a', 'b']).rolling(on='c').sum(), you should just do result[['a', 'b', 'c']] = df[['a', 'b', 'c']]. For the user who wants the result as a separate object, this is a bit inconvenient. But for the user who wants to add the result back to the input df, it is more convenient. df['rolled_sum'] = df.groupby(['a', 'b']).rolling(on='c').sum().

Somewhat related, in #49543 I proposed essentially having as_index=[True|False] be adhered to for transforms rather than ignored.

@Demetrio92

having burned myself enough with those indices on more complicated dataframes than in the example above, I recommend using a join:

In #51751 (comment) you can make the very end be .reset_index(level=0, drop=True) but this would only work for a single grouping (e.g. df.groupby('a') and not df.groupby(['a', 'b']).

@mroeschke
Copy link
Member

If it makes things consistent with groupby transforms and pure rolling operations, I would be OK now to not include the groups or on axis in the index of the result.

@Demetrio92
Copy link

Demetrio92 commented Dec 21, 2023

so, would just not including the groupings allow for

df['rolled_sum'] = df.groupby(['a', 'b']).rolling(on='c').agg('sum')

without surprises?

@rhshadrach
Copy link
Member Author

My preferred approach would be to enable adding the groups to the index/column/neither across all operations but defaulting to the current behavior (aggregations would add to the index, transforms/filters would do neither). Then we could change the default behavior here while still allowing users to easily maintain current behavior, and we wouldn't have a groupby argument (as_index) that gets ignored for certain operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Groupby Needs Discussion Requires discussion from core team before further action Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

4 participants