Skip to content

API: Add pipe method to GroupBy objects #10353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghl3 opened this issue Jun 14, 2015 · 6 comments · Fixed by #17871
Closed

API: Add pipe method to GroupBy objects #10353

ghl3 opened this issue Jun 14, 2015 · 6 comments · Fixed by #17871
Labels
API Design Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@ghl3
Copy link

ghl3 commented Jun 14, 2015

Extend the new "pipe" protocol to GroupBy objects to allow for piping of a wider class of functions. Currently, one can only create pipes that chain together objects inheriting from NDFrame. But the concept of piping is general and could be extended to other pandas objects, specifically anything inheriting from GroupBy.

The use case is to write pipe that allow one to freely transform back-and-forth between NDFrames and GroupBy objects. Example:

df = DataFrame({A: [...], B: [...]})

def f(dfgb):
    return dfgb['B'].value_counts()

def g(srs):
    return srs * 2

grouped = df.groupby('A')

grouped.pipe(f).pipe(g)

Note that these transformations are transformations are

  • GroupBy -> Series
  • Series -> Series
    and the chain seamlessly switches from a GroupBy.pipe to a NDFrame.pipe

There are a few ways to implement this. A simple way is to break out the core functionality of "pipe" into a pure function and then to call that function in any method implementation of pipe. Another way is to think of piping as a mix-in trait, put it as a method in a base class, and then mix that base class into any class that wants to implement pipe-ability. I have no strong preference between these options, and I'm open to other implementations that may be more inline with Pandas' design goals or the long-term vision of the "pipe" concept.

A strawman implementation of the first implementation suggestion can be found here:
master...ghl3:groupby-pipe

CC
@TomAugspurger
@shoyer

@shoyer
Copy link
Member

shoyer commented Jun 14, 2015

Yep, this looks like a good idea to me.

I don't have strong feelings about how its implemented, though I suspect the mixin approach is the way to go -- that could make it easier to add documentation specific to each class without a lot of duplicated text. (On the other hand, I suspect few people look at the docstrings on groupby methods.) Either way it's pretty straightforward.

@jorisvandenbossche
Copy link
Member

How is this different than the already existing df.groupby(..).apply(..)? (apart from the ability to pass a tuple in pipe)

@ghl3
Copy link
Author

ghl3 commented Jun 14, 2015

It's similar. df.groupby(..).apply(..) applies the function to the underlying DataFrame in each group, so the function should take a DataFrame. This pipe implementation would act on the groupby itself, so you pass it functions whose argument is a DataFrameGroupBy.

So, with this, you could do:

def f(dfgb):
    return dfgb.get_group('A')

dfgb.pipe(f)

But if you tried with apply, it would fail:

dfgb.apply(f)

@jorisvandenbossche
Copy link
Member

Ah sorry, misread the fact it would act on the whole GroupBy object instead on the DataFrames/groups.

Do you have an example of a real use case for this? (apart from the dummy example above, just curious)

@ghl3
Copy link
Author

ghl3 commented Jun 14, 2015

Sure. One thing I find myself doing a lot with pandas is working on classification problems (as in machine-learning like problems), and in particular using pandas plotting as a means of exploring and diagnosing classification problems. I've found that a nice way to build helper functions for transforming and plotting data in this domain is to work with a DataFrameGroupBy, where the grouped variable is the class associated with the classification. It's simply a convenient interface to build functions around, as the class is implicit in the grouped column, and it supports multiple classes or nested classes, etc. I have many such functions whose first argument is a DataFrameGroupBy.

So, a common pattern is to start with an initial dataframe, do a lot of transformations on it, and then feed it into a function that takes a DataFrameGroupBy for the purpose of plotting or reporting. This means that the last function call in my chain of transformations is one that takes a dataframe group by. With the new piping functions, it would be nice to do something like:

df = df.pipe(f).pipe(g).pipe(h).groupby('group).pipe(generate_report)

Of course, all this is possible without adding a pipe function, one can always create a temporary variable or do this in a number of other ways, but since we already have the pipe function on the DataFrame, I think adding it to the GroupBy creates a nice symmetry.

@jreback jreback added API Design Needs Discussion Requires discussion from core team before further action labels Jun 17, 2015
@jreback jreback added this to the 0.17.0 milestone Jun 30, 2015
@jankatins
Copy link
Contributor

If I read the R data rangling cheatsheet right, then in R-land df %>% group_by(...) %>% mutate(...) (or ... %>% summarise(…)is basically df.groupby(...).apply(...) (i.e. apply the function to each group). Not sure if then pipe should map to different semantic (e.g. pipe in the complete gb object).

ghl3 added a commit to ghl3/pandas that referenced this issue Sep 13, 2015
@jreback jreback modified the milestones: 0.17.0, 0.17.1 Sep 25, 2015
@jreback jreback modified the milestones: Next Major Release, 0.17.1 Nov 13, 2015
@jreback jreback modified the milestones: Next Major Release, 0.21.0 Oct 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Needs Discussion Requires discussion from core team before further action
Projects
None yet
5 participants