Skip to content

ENH: allow .rolling / .expanding as groupby methods #12743

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Mar 30, 2016

closes #12738
closes #12486
closes #12363

  • more tests (other methods)

- [ ] doc section in groupby will do later

In [3]: pd.options.display.max_rows=10

In [4]:    df = pd.DataFrame({'A': [1] * 20 + [2] * 12 + [3] * 8,
                      'B': np.arange(40)})

In [5]: df
Out[5]: 
    A   B
0   1   0
1   1   1
2   1   2
3   1   3
4   1   4
.. ..  ..
35  3  35
36  3  36
37  3  37
38  3  38
39  3  39

[40 rows x 2 columns]

In [6]:    df.groupby('A').apply(lambda x: x.rolling(4).B.mean())
Out[6]: 
A    
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
         ... 
3  35    33.5
   36    34.5
   37    35.5
   38    36.5
   39    37.5
Name: B, dtype: float64

In [7]:    df.groupby('A').rolling(4).B.mean()
Out[7]: 
A    
1  0      NaN
   1      NaN
   2      NaN
   3      1.5
   4      2.5
         ... 
3  35    33.5
   36    34.5
   37    35.5
   38    36.5
   39    37.5
Name: B, dtype: float64
In [9]:    df.index = pd.date_range('20130101',freq='s',periods=40)

In [10]:    df
Out[10]: 
                     A   B
2013-01-01 00:00:00  1   0
2013-01-01 00:00:01  1   1
2013-01-01 00:00:02  1   2
2013-01-01 00:00:03  1   3
2013-01-01 00:00:04  1   4
...                 ..  ..
2013-01-01 00:00:35  3  35
2013-01-01 00:00:36  3  36
2013-01-01 00:00:37  3  37
2013-01-01 00:00:38  3  38
2013-01-01 00:00:39  3  39

[40 rows x 2 columns]

In [11]:    df.groupby('A').apply(lambda x: x.resample('4s').mean())
Out[11]: 
                         A     B
A                               
1 2013-01-01 00:00:00  1.0   1.5
  2013-01-01 00:00:04  1.0   5.5
  2013-01-01 00:00:08  1.0   9.5
  2013-01-01 00:00:12  1.0  13.5
  2013-01-01 00:00:16  1.0  17.5
2 2013-01-01 00:00:20  2.0  21.5
  2013-01-01 00:00:24  2.0  25.5
  2013-01-01 00:00:28  2.0  29.5
3 2013-01-01 00:00:32  3.0  33.5
  2013-01-01 00:00:36  3.0  37.5

In [12]:    df.groupby('A').resample('4s').mean()
Out[12]: 
                         A     B
A                               
1 2013-01-01 00:00:00  1.0   1.5
  2013-01-01 00:00:04  1.0   5.5
  2013-01-01 00:00:08  1.0   9.5
  2013-01-01 00:00:12  1.0  13.5
  2013-01-01 00:00:16  1.0  17.5
2 2013-01-01 00:00:20  2.0  21.5
  2013-01-01 00:00:24  2.0  25.5
  2013-01-01 00:00:28  2.0  29.5
3 2013-01-01 00:00:32  3.0  33.5
  2013-01-01 00:00:36  3.0  37.5

@jreback jreback added Enhancement Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 30, 2016
@jreback jreback added this to the 0.18.1 milestone Mar 30, 2016
@jreback
Copy link
Contributor Author

jreback commented Mar 30, 2016

cc @lminer

@@ -794,6 +794,7 @@ def _concat_objects(self, keys, values, not_indexed_same=False):

if isinstance(result, Series):
result = result.reindex(ax)
result.name = self.name
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sinhrks this fixed a couple of tests in groupby below, but I don't know if we have a related issue. any idea?

related (but did not fix) #12363

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, I can find no open issue.

@jreback jreback changed the title [WIP] ENH: allow .rolling / .expanding as groupby methods ENH: allow .rolling / .expanding as groupby methods Mar 31, 2016
@jreback
Copy link
Contributor Author

jreback commented Mar 31, 2016

ok, this is ready. comments.

@shoyer @jorisvandenbossche @TomAugspurger @sinhrks

@jorisvandenbossche
Copy link
Member

Is the example at the top still up to date?
I am wondering a bit if the grouper should be in the result or not?

@jorisvandenbossche
Copy link
Member

Also, how does the result looks like if you have a non-sorted grouper?
Currently with apply it sorts the groups (but also includes the grouper as the index)

@jreback
Copy link
Contributor Author

jreback commented Apr 2, 2016

@jorisvandenbossche all updated. Had to work thru some issues. But much more fully tested now. This basically replicates what .apply is doing with the new syntax by actually using apply. In theory we could have a more efficient impl, but will leave that for later.

Also this PR cleans up a bunch of cases where the name is not returned for groupbys. So much more consistency now.

@@ -339,16 +342,23 @@ def __init__(self, obj, keys=None, axis=0, level=None,
self.sort = sort
self.group_keys = group_keys
self.squeeze = squeeze
self.mutated = kwargs.pop('mutated', False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a massive hack using. well hack is the wrong word. Its a way of informing the groupby that we want to force the multiindex construction path which is normally taken only when things are mutated. its not an external visible kw.

@jreback
Copy link
Contributor Author

jreback commented Apr 10, 2016

@jorisvandenbossche if you'd have a look

@jorisvandenbossche
Copy link
Member

Will take a look later today!

'B': np.arange(40)})
df

You can now use ``.rolling(..)`` and ``.expanding(..)`` as methods on groupbys. These return another object where you operate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"These return another object where you operate" -> this is not really a clear sentence. What do exactly want to say?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 19, 2016

@jreback one more lost index name, when you group a Series by another (using the original dataframe)

In [27]: df['B'].groupby(df.A).rolling(4).mean()
Out[27]:
A  idx
1  0       NaN
   1       NaN
   2       NaN
   3       1.5
   4       2.5
          ...
3  35     33.5
   36     34.5
   37     35.5
   38     36.5
   39     37.5
dtype: float64

Let me know if you want to fix that here, otherwise I'll open an issue.

@TomAugspurger
Copy link
Contributor

And I'm (correctly) seeing a MultiIndex for the df.groupby('A').resample('3s').mean()

@jorisvandenbossche
Copy link
Member

Will try again (I just fetched the head of this PR, didn't rebase myself as I thought you already did that)

@jorisvandenbossche
Copy link
Member

@TomAugspurger It's with the rolling case I do not see a MultiIndex (df.groupby('A').rolling(3).mean()), resample did work for me as well.

@jreback Tested it again (fetched this PR, rebased on latest master myself, rebuild pandas to be certain), and still getting the same result as I showed in #12743 (comment) (using python 2.7, numpy 1.10.1, Windows)

@jreback
Copy link
Contributor Author

jreback commented Apr 19, 2016

@jorisvandenbossche hmm, ok I did on 2.7/1.10.4 and it doesn't look right. odd. (on windows)

@jreback
Copy link
Contributor Author

jreback commented Apr 19, 2016

ok, this has been failing on windows on my branch : https://ci.appveyor.com/project/jreback/pandas/build/1.0.2077/job/qfrsv7p46f6ns6xg

@jreback
Copy link
Contributor Author

jreback commented Apr 19, 2016

ok, this is not a 2.7/numpy issue, but something specific to windows.

@jreback
Copy link
Contributor Author

jreback commented Apr 19, 2016

ok pushed a commit to fix. This comes back to some of the 'figuring what the shape of the return' is logic. IOW its a heuristic depending on whether something is mutated (this is the standard case). But when we are doing a chained window/resample we need to force this.

@jreback
Copy link
Contributor Author

jreback commented Apr 19, 2016

pushed a new version, had some flake/formatting issues on windows.

@jreback jreback force-pushed the expand branch 3 times, most recently from 209c013 to 190ecd0 Compare April 19, 2016 16:18
@jreback
Copy link
Contributor Author

jreback commented Apr 19, 2016

@TomAugspurger I picked up that last name fix (was a bug in the concat step).

closes pandas-dev#12738

BUG: allow df.groupby(...).resample(...) to return a Resampler groupby object

closes pandas-dev#12486

BUG: consistency of name of returned groupby

closes pandas-dev#12363
@jreback
Copy link
Contributor Author

jreback commented Apr 22, 2016

@TomAugspurger @jorisvandenbossche any more comments.

@TomAugspurger
Copy link
Contributor

Nothing else from me 👍

@jreback
Copy link
Contributor Author

jreback commented Apr 25, 2016

ok will merge shortly unless @jorisvandenbossche has any further comments.

@jreback jreback closed this in 6994240 Apr 26, 2016
@jorisvandenbossche
Copy link
Member

Tested on master, and now indeed the inconsistency is solved! Thanks

@ibigquant
Copy link

ibigquant commented Aug 11, 2017

@jreback, thanks.

I want to do grouby then rolling then corr on two columns, code below takes 18s+ on 1M rows:

df.groupby('name', as_index=False, sort=False, group_keys=False).apply(
        lambda x: x['_a'].rolling(d).corr(other=x['_b'], pairwise=True))

Is there a faster way to do this? i tried code but it does not work (got exceptions):

_g = df.groupby('name', as_index=False, sort=False, group_keys=False)
_g['_a'].rolling(5).corr(other=_g['_b'], pairwise=False)

Another question: how to ignore the group key (just keep the original index) for below code:

_g['_a'].rolling(d).min()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
5 participants