Skip to content

Commit f927c50

Browse files
topper-123alanbato
authored andcommitted
Add GroupBy.pipe method (pandas-dev#17871)
1 parent 05c4c2d commit f927c50

File tree

8 files changed

+267
-28
lines changed

8 files changed

+267
-28
lines changed

doc/source/api.rst

+1
Original file line numberDiff line numberDiff line change
@@ -1973,6 +1973,7 @@ Function application
19731973
GroupBy.apply
19741974
GroupBy.aggregate
19751975
GroupBy.transform
1976+
GroupBy.pipe
19761977

19771978
Computations / Descriptive Stats
19781979
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

doc/source/groupby.rst

+49
Original file line numberDiff line numberDiff line change
@@ -1165,6 +1165,55 @@ See the :ref:`visualization documentation<visualization.box>` for more.
11651165
to ``df.boxplot(by="g")``. See :ref:`here<visualization.box.return>` for
11661166
an explanation.
11671167

1168+
.. _groupby.pipe:
1169+
1170+
Piping function calls
1171+
~~~~~~~~~~~~~~~~~~~~~
1172+
1173+
.. versionadded:: 0.21.0
1174+
1175+
Similar to the functionality provided by ``DataFrame`` and ``Series``, functions
1176+
that take ``GroupBy`` objects can be chained together using a ``pipe`` method to
1177+
allow for a cleaner, more readable syntax. To read about ``.pipe`` in general terms,
1178+
see :ref:`here <basics.pipe>`.
1179+
1180+
Combining ``.groupby`` and ``.pipe`` is often useful when you need to reuse
1181+
GroupB objects.
1182+
1183+
For an example, imagine having a DataFrame with columns for stores, products,
1184+
revenue and sold quantity. We'd like to do a groupwise calculation of *prices*
1185+
(i.e. revenue/quantity) per store and per product. We could do this in a
1186+
multi-step operation, but expressing it in terms of piping can make the
1187+
code more readable. First we set the data:
1188+
1189+
.. ipython:: python
1190+
1191+
import numpy as np
1192+
n = 1000
1193+
df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
1194+
'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n),
1195+
'Revenue': (np.random.random(n)*50+10).round(2),
1196+
'Quantity': np.random.randint(1, 10, size=n)})
1197+
df.head(2)
1198+
1199+
Now, to find prices per store/product, we can simply do:
1200+
1201+
.. ipython:: python
1202+
1203+
(df.groupby(['Store', 'Product'])
1204+
.pipe(lambda grp: grp.Revenue.sum()/grp.Quantity.sum())
1205+
.unstack().round(2))
1206+
1207+
Piping can also be expressive when you want to deliver a grouped object to some
1208+
arbitrary function, for example:
1209+
1210+
.. code-block:: python
1211+
1212+
(df.groupby(['Store', 'Product']).pipe(report_func)
1213+
1214+
where ``report_func`` takes a GroupBy object and creates a report
1215+
from that.
1216+
11681217
Examples
11691218
--------
11701219

doc/source/whatsnew/v0.21.0.txt

+39
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ Highlights include:
1414
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.
1515
- The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent and no longer depends on whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed, see :ref:`here <whatsnew_0210.api_breaking.bottleneck>`
1616
- Compatibility fixes for pypy, see :ref:`here <whatsnew_0210.pypy>`.
17+
- ``GroupBy`` objects now have a ``pipe`` method, similar to the one on ``DataFrame`` and ``Series``.
18+
This allows for functions that take a ``GroupBy`` to be composed in a clean, readable syntax, see :ref:`here <whatsnew_0210.enhancements.GroupBy_pipe>`.
1719

1820
Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.
1921

@@ -202,6 +204,43 @@ still the string ``'category'``. We'll take this moment to remind users that the
202204

203205
See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.
204206

207+
.. _whatsnew_0210.enhancements.GroupBy_pipe:
208+
209+
``GroupBy`` objects now have a ``pipe`` method
210+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
211+
212+
``GroupBy`` objects now have a ``pipe`` method, similar to the one on
213+
``DataFrame`` and ``Series``, that allow for functions that take a
214+
``GroupBy`` to be composed in a clean, readable syntax. (:issue:`17871`)
215+
216+
For a concrete example on combining ``.groupby`` and ``.pipe`` , imagine having a
217+
DataFrame with columns for stores, products, revenue and sold quantity. We'd like to
218+
do a groupwise calculation of *prices* (i.e. revenue/quantity) per store and per product.
219+
We could do this in a multi-step operation, but expressing it in terms of piping can make the
220+
code more readable.
221+
222+
First we set the data:
223+
224+
.. ipython:: python
225+
226+
import numpy as np
227+
n = 1000
228+
df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
229+
'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n),
230+
'Revenue': (np.random.random(n)*50+10).round(2),
231+
'Quantity': np.random.randint(1, 10, size=n)})
232+
df.head(2)
233+
234+
Now, to find prices per store/product, we can simply do:
235+
236+
.. ipython:: python
237+
238+
(df.groupby(['Store', 'Product'])
239+
.pipe(lambda grp: grp.Revenue.sum()/grp.Quantity.sum())
240+
.unstack().round(2))
241+
242+
See the :ref:`documentation <groupby.pipe>` for more.
243+
205244
.. _whatsnew_0210.enhancements.other:
206245

207246
Other Enhancements

pandas/core/common.py

+35
Original file line numberDiff line numberDiff line change
@@ -664,3 +664,38 @@ def _get_distinct_objs(objs):
664664
ids.add(id(obj))
665665
res.append(obj)
666666
return res
667+
668+
669+
def _pipe(obj, func, *args, **kwargs):
670+
"""
671+
Apply a function ``func`` to object ``obj`` either by passing obj as the
672+
first argument to the function or, in the case that the func is a tuple,
673+
interpret the first element of the tuple as a function and pass the obj to
674+
that function as a keyword argument whose key is the value of the second
675+
element of the tuple.
676+
677+
Parameters
678+
----------
679+
func : callable or tuple of (callable, string)
680+
Function to apply to this object or, alternatively, a
681+
``(callable, data_keyword)`` tuple where ``data_keyword`` is a
682+
string indicating the keyword of `callable`` that expects the
683+
object.
684+
args : iterable, optional
685+
positional arguments passed into ``func``.
686+
kwargs : dict, optional
687+
a dictionary of keyword arguments passed into ``func``.
688+
689+
Returns
690+
-------
691+
object : the return type of ``func``.
692+
"""
693+
if isinstance(func, tuple):
694+
func, target = func
695+
if target in kwargs:
696+
msg = '%s is both the pipe target and a keyword argument' % target
697+
raise ValueError(msg)
698+
kwargs[target] = obj
699+
return func(*args, **kwargs)
700+
else:
701+
return func(obj, *args, **kwargs)

pandas/core/generic.py

+6-12
Original file line numberDiff line numberDiff line change
@@ -3497,8 +3497,10 @@ def sample(self, n=None, frac=None, replace=False, weights=None,
34973497
Alternatively a ``(callable, data_keyword)`` tuple where
34983498
``data_keyword`` is a string indicating the keyword of
34993499
``callable`` that expects the %(klass)s.
3500-
args : positional arguments passed into ``func``.
3501-
kwargs : a dictionary of keyword arguments passed into ``func``.
3500+
args : iterable, optional
3501+
positional arguments passed into ``func``.
3502+
kwargs : mapping, optional
3503+
a dictionary of keyword arguments passed into ``func``.
35023504
35033505
Returns
35043506
-------
@@ -3508,7 +3510,7 @@ def sample(self, n=None, frac=None, replace=False, weights=None,
35083510
-----
35093511
35103512
Use ``.pipe`` when chaining together functions that expect
3511-
on Series or DataFrames. Instead of writing
3513+
Series, DataFrames or GroupBy objects. Instead of writing
35123514
35133515
>>> f(g(h(df), arg1=a), arg2=b, arg3=c)
35143516
@@ -3537,15 +3539,7 @@ def sample(self, n=None, frac=None, replace=False, weights=None,
35373539

35383540
@Appender(_shared_docs['pipe'] % _shared_doc_kwargs)
35393541
def pipe(self, func, *args, **kwargs):
3540-
if isinstance(func, tuple):
3541-
func, target = func
3542-
if target in kwargs:
3543-
raise ValueError('%s is both the pipe target and a keyword '
3544-
'argument' % target)
3545-
kwargs[target] = self
3546-
return func(*args, **kwargs)
3547-
else:
3548-
return func(self, *args, **kwargs)
3542+
return com._pipe(self, func, *args, **kwargs)
35493543

35503544
_shared_docs['aggregate'] = ("""
35513545
Aggregate using callable, string, dict, or list of string/callables

pandas/core/groupby.py

+57-5
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@
4040

4141
from pandas.core.common import (_values_from_object, AbstractMethodError,
4242
_default_index, _not_none, _get_callable_name,
43-
_asarray_tuplesafe)
43+
_asarray_tuplesafe, _pipe)
4444

4545
from pandas.core.base import (PandasObject, SelectionMixin, GroupByError,
4646
DataError, SpecificationError)
@@ -656,9 +656,10 @@ def __iter__(self):
656656
@Substitution(name='groupby')
657657
def apply(self, func, *args, **kwargs):
658658
"""
659-
Apply function and combine results together in an intelligent way. The
660-
split-apply-combine combination rules attempt to be as common sense
661-
based as possible. For example:
659+
Apply function and combine results together in an intelligent way.
660+
661+
The split-apply-combine combination rules attempt to be as common
662+
sense based as possible. For example:
662663
663664
case 1:
664665
group DataFrame
@@ -692,7 +693,10 @@ def apply(self, func, *args, **kwargs):
692693
693694
See also
694695
--------
695-
aggregate, transform"""
696+
pipe : Apply function to the full GroupBy object instead of to each
697+
group.
698+
aggregate, transform
699+
"""
696700

697701
func = self._is_builtin_func(func)
698702

@@ -1691,6 +1695,54 @@ def tail(self, n=5):
16911695
mask = self._cumcount_array(ascending=False) < n
16921696
return self._selected_obj[mask]
16931697

1698+
def pipe(self, func, *args, **kwargs):
1699+
""" Apply a function with arguments to this GroupBy object,
1700+
1701+
.. versionadded:: 0.21.0
1702+
1703+
Parameters
1704+
----------
1705+
func : callable or tuple of (callable, string)
1706+
Function to apply to this GroupBy object or, alternatively, a
1707+
``(callable, data_keyword)`` tuple where ``data_keyword`` is a
1708+
string indicating the keyword of ``callable`` that expects the
1709+
GroupBy object.
1710+
args : iterable, optional
1711+
positional arguments passed into ``func``.
1712+
kwargs : dict, optional
1713+
a dictionary of keyword arguments passed into ``func``.
1714+
1715+
Returns
1716+
-------
1717+
object : the return type of ``func``.
1718+
1719+
Notes
1720+
-----
1721+
Use ``.pipe`` when chaining together functions that expect
1722+
Series, DataFrames or GroupBy objects. Instead of writing
1723+
1724+
>>> f(g(h(df.groupby('group')), arg1=a), arg2=b, arg3=c)
1725+
1726+
You can write
1727+
1728+
>>> (df
1729+
... .groupby('group')
1730+
... .pipe(f, arg1)
1731+
... .pipe(g, arg2)
1732+
... .pipe(h, arg3))
1733+
1734+
See more `here
1735+
<http://pandas.pydata.org/pandas-docs/stable/groupby.html#pipe>`_
1736+
1737+
See Also
1738+
--------
1739+
pandas.Series.pipe : Apply a function with arguments to a series
1740+
pandas.DataFrame.pipe: Apply a function with arguments to a dataframe
1741+
apply : Apply function to each group instead of to the
1742+
full GroupBy object.
1743+
"""
1744+
return _pipe(self, func, *args, **kwargs)
1745+
16941746

16951747
GroupBy._add_numeric_operations()
16961748

pandas/tests/groupby/test_groupby.py

+69
Original file line numberDiff line numberDiff line change
@@ -3762,6 +3762,75 @@ def test_gb_key_len_equal_axis_len(self):
37623762
assert df.loc[('foo', 'bar', 'B')] == 2
37633763
assert df.loc[('foo', 'baz', 'C')] == 1
37643764

3765+
def test_pipe(self):
3766+
# Test the pipe method of DataFrameGroupBy.
3767+
# Issue #17871
3768+
3769+
random_state = np.random.RandomState(1234567890)
3770+
3771+
df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
3772+
'foo', 'bar', 'foo', 'foo'],
3773+
'B': random_state.randn(8),
3774+
'C': random_state.randn(8)})
3775+
3776+
def f(dfgb):
3777+
return dfgb.B.max() - dfgb.C.min().min()
3778+
3779+
def square(srs):
3780+
return srs ** 2
3781+
3782+
# Note that the transformations are
3783+
# GroupBy -> Series
3784+
# Series -> Series
3785+
# This then chains the GroupBy.pipe and the
3786+
# NDFrame.pipe methods
3787+
result = df.groupby('A').pipe(f).pipe(square)
3788+
3789+
index = Index([u'bar', u'foo'], dtype='object', name=u'A')
3790+
expected = pd.Series([8.99110003361, 8.17516964785], name='B',
3791+
index=index)
3792+
3793+
assert_series_equal(expected, result)
3794+
3795+
def test_pipe_args(self):
3796+
# Test passing args to the pipe method of DataFrameGroupBy.
3797+
# Issue #17871
3798+
3799+
df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'C'],
3800+
'x': [1.0, 2.0, 3.0, 2.0, 5.0],
3801+
'y': [10.0, 100.0, 1000.0, -100.0, -1000.0]})
3802+
3803+
def f(dfgb, arg1):
3804+
return (dfgb.filter(lambda grp: grp.y.mean() > arg1, dropna=False)
3805+
.groupby(dfgb.grouper))
3806+
3807+
def g(dfgb, arg2):
3808+
return dfgb.sum() / dfgb.sum().sum() + arg2
3809+
3810+
def h(df, arg3):
3811+
return df.x + df.y - arg3
3812+
3813+
result = (df
3814+
.groupby('group')
3815+
.pipe(f, 0)
3816+
.pipe(g, 10)
3817+
.pipe(h, 100))
3818+
3819+
# Assert the results here
3820+
index = pd.Index(['A', 'B', 'C'], name='group')
3821+
expected = pd.Series([-79.5160891089, -78.4839108911, None],
3822+
index=index)
3823+
3824+
assert_series_equal(expected, result)
3825+
3826+
# test SeriesGroupby.pipe
3827+
ser = pd.Series([1, 1, 2, 2, 3, 3])
3828+
result = ser.groupby(ser).pipe(lambda grp: grp.sum() * grp.count())
3829+
3830+
expected = pd.Series([4, 8, 12], index=pd.Int64Index([1, 2, 3]))
3831+
3832+
assert_series_equal(result, expected)
3833+
37653834

37663835
def _check_groupby(df, result, keys, field, f=lambda x: x.sum()):
37673836
tups = lmap(tuple, df[keys].values)

pandas/tests/groupby/test_whitelist.py

+11-11
Original file line numberDiff line numberDiff line change
@@ -239,17 +239,17 @@ def test_groupby_blacklist(df_letters):
239239
def test_tab_completion(mframe):
240240
grp = mframe.groupby(level='second')
241241
results = set([v for v in dir(grp) if not v.startswith('_')])
242-
expected = set(
243-
['A', 'B', 'C', 'agg', 'aggregate', 'apply', 'boxplot', 'filter',
244-
'first', 'get_group', 'groups', 'hist', 'indices', 'last', 'max',
245-
'mean', 'median', 'min', 'ngroups', 'nth', 'ohlc', 'plot',
246-
'prod', 'size', 'std', 'sum', 'transform', 'var', 'sem', 'count',
247-
'nunique', 'head', 'describe', 'cummax', 'quantile',
248-
'rank', 'cumprod', 'tail', 'resample', 'cummin', 'fillna',
249-
'cumsum', 'cumcount', 'ngroup', 'all', 'shift', 'skew',
250-
'take', 'tshift', 'pct_change', 'any', 'mad', 'corr', 'corrwith',
251-
'cov', 'dtypes', 'ndim', 'diff', 'idxmax', 'idxmin',
252-
'ffill', 'bfill', 'pad', 'backfill', 'rolling', 'expanding'])
242+
expected = {
243+
'A', 'B', 'C', 'agg', 'aggregate', 'apply', 'boxplot', 'filter',
244+
'first', 'get_group', 'groups', 'hist', 'indices', 'last', 'max',
245+
'mean', 'median', 'min', 'ngroups', 'nth', 'ohlc', 'plot',
246+
'prod', 'size', 'std', 'sum', 'transform', 'var', 'sem', 'count',
247+
'nunique', 'head', 'describe', 'cummax', 'quantile',
248+
'rank', 'cumprod', 'tail', 'resample', 'cummin', 'fillna',
249+
'cumsum', 'cumcount', 'ngroup', 'all', 'shift', 'skew',
250+
'take', 'tshift', 'pct_change', 'any', 'mad', 'corr', 'corrwith',
251+
'cov', 'dtypes', 'ndim', 'diff', 'idxmax', 'idxmin',
252+
'ffill', 'bfill', 'pad', 'backfill', 'rolling', 'expanding', 'pipe'}
253253
assert results == expected
254254

255255

0 commit comments

Comments
 (0)