Skip to content

ENH: add GroupBy.pipe method #17871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 18, 2017
Merged

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Oct 14, 2017

This PR adds a .pipe method to GroupBy objects like for DataFrame.pipe and Series .pipe.

This PR is basically #10466 written by @ghl3 with some very minor updates, because that PR somehow got stalled and subsequently was closed.

This PR says it's new in 0.21, but I'll change that if it's too late to add this.

@topper-123 topper-123 changed the title add GroupBy.pipe method ENH: add GroupBy.pipe method Oct 14, 2017
allow for a cleaner, more readable syntax. To read about ``.pipe`` in general terms,
see :ref:`here <basics.pipe>`.

For a concrete example on combining ``.groupby`` and ``.pipe`` , imagine have a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have -> having


.. ipython:: python

from numpy.random import choice, random
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't import like like, rather import numpy as np and write things out (e.g. np.random.choice

(base_df.pipe(lambda x: x[x.A>3])
.groupby(['Store', 'Product'])
.pipe(rapport_func)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit abstract

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've redone it with the previous df. Alternatively, I could remove this example.

element of the tuple.

func : callable or tuple of (callable, string)
Function to apply to this GroupBy or, alternatively, a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove the GroupBy from here and make this more generic

kwargs[target] = obj
return func(*args, **kwargs)
else:
return func(obj, *args, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a return at the end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@ghl3
Copy link

ghl3 commented Oct 14, 2017

@topper-123 Thanks a lot for reviving this feature.

@topper-123 topper-123 force-pushed the groupby.pipe branch 3 times, most recently from a88a50a to 956483f Compare October 14, 2017 23:04
@codecov
Copy link

codecov bot commented Oct 14, 2017

Codecov Report

Merging #17871 into master will increase coverage by <.01%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17871      +/-   ##
==========================================
+ Coverage   91.23%   91.24%   +<.01%     
==========================================
  Files         163      163              
  Lines       50105    50110       +5     
==========================================
+ Hits        45715    45723       +8     
+ Misses       4390     4387       -3
Flag Coverage Δ
#multiple 89.05% <50%> (+0.02%) ⬆️
#single 40.31% <16.66%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 92.53% <100%> (+0.33%) ⬆️
pandas/core/groupby.py 91.99% <100%> (ø) ⬆️
pandas/core/common.py 91.18% <33.33%> (-1.63%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.75% <0%> (-0.1%) ⬇️
pandas/core/indexes/datetimes.py 95.41% <0%> (-0.1%) ⬇️
pandas/core/dtypes/dtypes.py 95.14% <0%> (ø) ⬆️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5bf7f9a...608a0e4. Read the comment docs.

@topper-123
Copy link
Contributor Author

@jreback , I've made changes to the PR. Travis was green and newest forcepush is just for some linting issues.

@ghl3, well, thank you for writing most of this:-)

@@ -235,6 +235,9 @@ Other Enhancements
- Improved the import time of pandas by about 2.25x. (:issue:`16764`)
- :func:`read_json` and :func:`to_json` now accept a ``compression`` argument which allows them to transparently handle compressed files. (:issue:`17798`)
- :func:`Series.reindex`, :func:`DataFrame.reindex`, :func:`Index.get_indexer` now support list-like argument for ``tolerance``. (:issue:`17367`)
- ``GroupBy`` objects now have a ``pipe`` method, similar to the one on ``DataFrame`` and ``Series``
that allow for functions that take a ``GroupBy`` to be composed in a clean, readable syntax.
See the :ref:`documentation <groupby.pipe>` for more. (:issue:`17871`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can move this to highlites is ok

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll move this to highlights and add an example in the whatsnew (if I understand you correctly?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jreback jreback added this to the 0.21.0 milestone Oct 14, 2017
@jreback
Copy link
Contributor

jreback commented Oct 14, 2017

small change

@TomAugspurger @jorisvandenbossche @shoyer

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 15, 2017

Had some git issue, so squashed all the commits.

I'm off to sleep, I'll look into eventual comments tomorrow.


(df.groupby(['Store', 'Product']).pipe(rapport_func)

where ``rapport_func`` take an arbitrary GroupBy object and creates a rapport
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 'report' is the correct English word? (also above in the code example)
(no native speaker, but in my mother tongue 'rapport' exists, but the english translation is 'report' :-))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's rapport in my language too. I'm changed that


.. versionadded:: 0.21.0

Similar to the functionality provided by ``DataFrames`` and ``Series``, functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrames -> DataFrame (or otherwise do not put it as a code object)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some minor comments

@@ -3508,7 +3510,7 @@ def sample(self, n=None, frac=None, replace=False, weights=None,
-----

Use ``.pipe`` when chaining together functions that expect
on Series or DataFrames. Instead of writing
Series, DataFrames or GroupBys. Instead of writing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change is needed? (as the docstring of GroupBy.pipe has a separate one?)

Copy link
Contributor Author

@topper-123 topper-123 Oct 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "on" was gramatically incorrect, So I changed the line. I think adding "GroupBy objects" add clarity that GroupBy objrcts can be used with piping. I didnt intend for this to only refer to series/DataFrames, but is too easy to misunderstand?

Parameters
----------
func : callable or tuple of (callable, string)
Function to apply to this GroupBy or, alternatively, a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The start of "Function ..." does not need to be aligned with the "callable .. " above, but just have a single identation of 4 spaces (numpy docstring specifics ..)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do "this GroupBy" -> "this GroupBy object"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changing.

@topper-123 topper-123 force-pushed the groupby.pipe branch 2 times, most recently from 680473c to adfb81c Compare October 17, 2017 01:10
@topper-123
Copy link
Contributor Author

The comments from @jorisvandenbossche have been addressed. Thanks for reviewing.

@@ -14,6 +14,8 @@ Highlights include:
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.
- The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent and no longer depends on whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed, see :ref:`here <whatsnew_0210.api_breaking.bottleneck>`
- Compatibility fixes for pypy, see :ref:`here <whatsnew_0210.pypy>`.
- ``GroupBy`` objects now have a ``pipe`` method, similar to the one on ``DataFrame`` and ``Series``,
that allows for functions that take a ``GroupBy`` to be composed in a clean, readable syntax.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a ref to the subsection

def pipe(self, func, *args, **kwargs):
""" Apply a function with arguments to this GroupBy object

.. versionadded:: 0.21.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of mostly repeated doc strings in both places, you can template _pipe and use and Appender/Substitution

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to not do this here. The dosctrings are sufficiently different to just make it a lot harder to develop when trying to merge them in a single one with a lot of substituted parts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thats fine

expected = pd.Series([-79.5160891089, -78.4839108911, None],
index=index)

assert_series_equal(expected, result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a tests on SeriesGroupBy

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

--------
pandas.Series.pipe
pandas.DataFrame.pipe
pandas.GroupBy.apply
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a short note on the difference here? (and also add a See also in apply to pipe, again with a short note on the difference)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried, but admittedly found it very hard to do in so few lines, see result. Do you have suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like: "Apply function to each group instead of to the full GroupBy object" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(IIUC) perhaps

``apply`` applies a function to each group. ``pipe`` applies a function to a ``GroupBy`` object.

would work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've uploaded new ones.

I've built the documentation, and pipe doesn't show up in the API. Do I need to add something somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, In groupby.rst line 1178 I have a link ":ref:here <basics.pipe>".

This link doesn't show up in the docs, can anyone see why? It seems normal

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 17, 2017

Hmm, I can see the newest code changes on my github repository, but they don't show up in the PR. Any suggestions on how to proceed?

for reference, the newest code her: topper-123@8614c32

EDIT: never mind, it works now.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2017

I think you need to add in api.rst as well. if docs build ok, ping on green.

@jreback jreback merged commit 5687f9e into pandas-dev:master Oct 18, 2017
@jreback
Copy link
Contributor

jreback commented Oct 18, 2017

thanks @ghl3 and @topper-123 for the revival

@ghl3
Copy link

ghl3 commented Oct 18, 2017

Great work, @jreback and @topper-123

yeemey pushed a commit to yeemey/pandas that referenced this pull request Oct 20, 2017
@topper-123 topper-123 deleted the groupby.pipe branch November 6, 2017 21:46
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API: Add pipe method to GroupBy objects
5 participants