Skip to content

API: Preferred MultiIndex result for groupby().rolling() for an object with a MultiIndex #38787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mroeschke opened this issue Dec 29, 2020 · 8 comments · Fixed by #40701
Closed
Labels
API - Consistency Internal Consistency of API/Behavior Groupby Window rolling, ewma, expanding
Milestone

Comments

@mroeschke
Copy link
Member

groupby().rolling() in master currently constructs the resulting MultiIndex manually by inserting groupby keys as the first level(s) and then the original object's Index as the second level(s).

However, groupby().rolling() behaves similarly to groupby().transform() (i.e. maintains the original shape), so should the resulting index align with results of groupby().transform()?

In [4]: pd.__version__
Out[4]: '1.3.0.dev0+238.g1d196295c'

In [6]: df = pd.DataFrame({'a': [1], 'b': [2]})

In [7]: df
Out[7]:
   a  b
0  1  2

# Example of the original DataFrame having a regular Index
In [8]: df.groupby('a').rolling(1).sum()
Out[8]:
       a    b
a
1 0  1.0  2.0

In [16]: df.groupby('a').transform(lambda x: np.sum(x))
Out[16]:
   b
0  2

In [9]: df = pd.DataFrame({'a': [1], 'b': [2]}, index=pd.MultiIndex.from_tuples([('idx1', 'idx2')], names=['label1', 'label2']))

In [9]: df
Out[9]:
               a  b
label1 label2
idx1   idx2    1  2

# Examples of the original DataFrame having a MultiIndex
In [10]: df.groupby('label1').rolling(1).sum()
Out[10]:
                        a    b
label1 label1 label2
idx1   idx1   idx2    1.0  2.0

In [20]: df.groupby('label1').transform(lambda x: np.sum(x))
Out[20]:
               a  b
label1 label2
idx1   idx2    1  2

In [11]: df.groupby('a').rolling(1).sum()
Out[11]:
                   a    b
a label1 label2
1 idx1   idx2    1.0  2.0

In [21]: df.groupby('a').transform(lambda x: np.sum(x))
Out[21]:
               b
label1 label2
idx1   idx2    2

In [12]: df.groupby(['label1', 'a']).rolling(1).sum()
Out[12]:
                          b
label1 a label1 label2
idx1   1 idx1   idx2    2.0

In [22]: df.groupby(['label1', 'a']).transform(lambda x: np.sum(x))
Out[22]:
               b
label1 label2
idx1   idx2    2

As shown, when the original object as a MultiIndex, there is consistency of the resulting MultiIndex for the groupby().rolling() result but can lead to redundancy. There is lack of consistency of the resulting MultiIndex for the groupby().transform() result but looks more convenient.

IMO I prefer the consistent result we have today in groupby().rolling() but open to thoughts.

@jorisvandenbossche
Copy link
Member

I made a notebook exploring a few different cases (both numeric as non-numeric groupby key, as column or as index), looking at the rolling results across recent versions, and comparing to apply/transform: https://nbviewer.jupyter.org/gist/jorisvandenbossche/38bf6bb776092bc6fda2b2967b85b43d

(still need to make a summary of it / distill my thoughts for the actual API question, but that will be for later)

@simonjayhawkins simonjayhawkins added the Blocker Blocking issue or pull request for an upcoming release label Dec 31, 2020
@simonjayhawkins simonjayhawkins added this to the 1.2.1 milestone Dec 31, 2020
@simonjayhawkins
Copy link
Member

I've labelled as blocker and milestoned as 1.2.1 as the backport #38784 is currently blocked by this discussion.

@simonjayhawkins
Copy link
Member

#38784 merged. blocker tag remains as 1st patch release on 1.2.x could be anytime dependant on severity of regressions.

@jorisvandenbossche
Copy link
Member

In [16]: df = pd.DataFrame({"key": ['val1', 'val1', 'val2'], "values": [1, 2, 3]})

In [17]: df2 = df.set_index("key")

In [18]: 
    ...: 
    ...: df2.groupby("key").rolling(2, min_periods=1).sum()
    ...: 
    ...: 
Out[18]: 
           values
key  key         
val1 val1     1.0
     val1     3.0
val2 val2     3.0

In [19]: df2.groupby("key").transform(lambda x: x.rolling(2, min_periods=1).sum())
Out[19]: 
      values
key         
val1     1.0
val1     3.0
val2     3.0

@TomAugspurger
Copy link
Contributor

The blocker here is deciding if the backport currently on 1.2.x is going to stay.

@jorisvandenbossche
Copy link
Member

In [20]: df = pd.DataFrame({"key": ['val1', 'val2', 'val1'], "values": [1, 2, 3]})

In [21]: df = df.set_index("key")

In [22]: df.groupby("key").rolling(2, min_periods=1).sum()
Out[22]: 
           values
key  key         
val1 val1     1.0
     val1     4.0
val2 val2     2.0

In [23]: df.groupby("key").transform(lambda x: x.rolling(2, min_periods=1).sum())
Out[23]: 
      values
key         
val1     1.0
val2     2.0
val1     4.0

In [26]: df.groupby("key", group_keys=True).apply(lambda x: x.rolling(2, min_periods=1).sum())
Out[26]: 
           values
key  key         
val1 val1     1.0
     val1     4.0
val2 val2     2.0

In [27]: df.groupby("key", group_keys=False).apply(lambda x: x.rolling(2, min_periods=1).sum())
Out[27]: 
      values
key         
val1     1.0
val1     4.0
val2     2.0

@jreback jreback removed the Blocker Blocking issue or pull request for an upcoming release label Jan 16, 2021
@jreback jreback modified the milestones: 1.2.1, 1.2.2 Jan 16, 2021
@simonjayhawkins
Copy link
Member

moving to 1.2.3.

@jorisvandenbossche we have a meeting Wednesday? maybe another quick discussion on this.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Mar 29, 2021

This change bit me on some code that I had with version 1.0.5 of pandas. But here's an example where it is really useful to preserve the second index.

head30.csv.gz

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.3.0.dev0+1182.g3c3589b6f9'

In [3]: df=pd.read_csv("head30.csv.gz", index_col=[0,1])

In [4]: df
Out[4]:
                    newcases
state   date
Alabama 2020-03-13       0.0
        2020-03-14       6.0
        2020-03-15      11.0
        2020-03-16       6.0
        2020-03-17      10.0
...                      ...
Wyoming 2020-04-05      11.0
        2020-04-06      13.0
        2020-04-07       8.0
        2020-04-08       9.0
        2020-04-09      73.0

[1650 rows x 1 columns]

In [5]: df.groupby("state").rolling(7, min_periods=1).mean()
Out[5]:
          newcases
state
Alabama   0.000000
Alabama   3.000000
Alabama   5.666667
Alabama   5.750000
Alabama   6.600000
...            ...
Wyoming  16.000000
Wyoming  16.714286
Wyoming  14.285714
Wyoming  13.142857
Wyoming  21.285714

[1650 rows x 1 columns]

Here, the data is by state and by date. I want the rolling average by state, knowing what the rolling average is on each date. With the current version (1.2.3) and the master version, I lose the dates. So you have to do a set_index() to get the index correct, which was non-intuitive.

In [6]:  df.groupby("state").rolling(7, min_periods=1).mean().set_index(df.index)
Out[6]:
                     newcases
state   date
Alabama 2020-03-13   0.000000
        2020-03-14   3.000000
        2020-03-15   5.666667
        2020-03-16   5.750000
        2020-03-17   6.600000
...                       ...
Wyoming 2020-04-05  16.000000
        2020-04-06  16.714286
        2020-04-07  14.285714
        2020-04-08  13.142857
        2020-04-09  21.285714

[1650 rows x 1 columns]

And here is what happened with pandas 1.0.5 (which is not great, because it duplicates the state, but at least the date index is preserved):

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.0.5'

In [3]: df=pd.read_csv("head30.csv.gz", index_col=[0,1])

In [4]: df
Out[4]:
                    newcases
state   date
Alabama 2020-03-13       0.0
        2020-03-14       6.0
        2020-03-15      11.0
        2020-03-16       6.0
        2020-03-17      10.0
...                      ...
Wyoming 2020-04-05      11.0
        2020-04-06      13.0
        2020-04-07       8.0
        2020-04-08       9.0
        2020-04-09      73.0

[1650 rows x 1 columns]

In [5]: df.groupby("state").rolling(7, min_periods=1).mean()
Out[5]:
                             newcases
state   state   date
Alabama Alabama 2020-03-13   0.000000
                2020-03-14   3.000000
                2020-03-15   5.666667
                2020-03-16   5.750000
                2020-03-17   6.600000
...                               ...
Wyoming Wyoming 2020-04-05  16.000000
                2020-04-06  16.714286
                2020-04-07  14.285714
                2020-04-08  13.142857
                2020-04-09  21.285714

[1650 rows x 1 columns]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Groupby Window rolling, ewma, expanding
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants