Skip to content

Commit 4f3e569

Browse files
author
tp
committed
Add GroupBy.pipe method
1 parent a85bfdc commit 4f3e569

File tree

7 files changed

+248
-24
lines changed

7 files changed

+248
-24
lines changed

doc/source/groupby.rst

+48
Original file line numberDiff line numberDiff line change
@@ -1165,6 +1165,54 @@ See the :ref:`visualization documentation<visualization.box>` for more.
11651165
to ``df.boxplot(by="g")``. See :ref:`here<visualization.box.return>` for
11661166
an explanation.
11671167

1168+
.. _groupby.pipe:
1169+
1170+
Piping function calls
1171+
~~~~~~~~~~~~~~~~~~~~~
1172+
1173+
.. versionadded:: 0.21.0
1174+
1175+
Similar to the functionality provided by ``DataFrames`` and ``Series``, functions
1176+
that take ``GroupBy`` objects can be chained together using a ``pipe`` method to
1177+
allow for a cleaner, more readable syntax. To read about ``.pipe`` in general terms,
1178+
see :ref:`here <basics.pipe>`.
1179+
1180+
For a concrete example on combining ``.groupby`` and ``.pipe`` , imagine having a
1181+
DataFrame with columns for stores, products, revenue and sold quantity. We'd like to
1182+
do a groupwise calculation of *prices* (i.e. revenue/quantity per store and per product).
1183+
We could do this in a multi-step operation, but expressing it in terms of piping can make the
1184+
code more readable.
1185+
1186+
First we set the data:
1187+
1188+
.. ipython:: python
1189+
1190+
import numpy as np
1191+
n = 1000
1192+
df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
1193+
'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n),
1194+
'Revenue': (np.random.random(n)*50+10).round(2),
1195+
'Quantity': np.random.randint(1, 10, size=n)})
1196+
df.head(2)
1197+
1198+
Now, to find prices per store/product, we can simply do:
1199+
1200+
.. ipython:: python
1201+
1202+
(df.groupby(['Store', 'Product'])
1203+
.pipe(lambda grp: grp.Revenue.sum()/grp.Quantity.sum())
1204+
.unstack().round(2))
1205+
1206+
Piping can also be expressive when you want to deliver a grouped object to some
1207+
arbitrary function, for example:
1208+
1209+
.. code-block:: python
1210+
1211+
(df.groupby(['Store', 'Product']).pipe(rapport_func)
1212+
1213+
where ``rapport_func`` take an arbitrary GroupBy object and creates a rapport
1214+
from that.
1215+
11681216
Examples
11691217
--------
11701218

doc/source/whatsnew/v0.21.0.txt

+39
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@ Highlights include:
1414
categoricals independent of the data, see :ref:`here <whatsnew_0210.enhancements.categorical_dtype>`.
1515
- The behavior of ``sum`` and ``prod`` on all-NaN Series/DataFrames is now consistent and no longer depends on whether `bottleneck <http://berkeleyanalytics.com/bottleneck>`__ is installed, see :ref:`here <whatsnew_0210.api_breaking.bottleneck>`
1616
- Compatibility fixes for pypy, see :ref:`here <whatsnew_0210.pypy>`.
17+
- ``GroupBy`` objects now have a ``pipe`` method, similar to the one on ``DataFrame`` and ``Series``,
18+
that allow for functions that take a ``GroupBy`` to be composed in a clean, readable syntax.
1719

1820
Check the :ref:`API Changes <whatsnew_0210.api_breaking>` and :ref:`deprecations <whatsnew_0210.deprecations>` before updating.
1921

@@ -202,6 +204,43 @@ still the string ``'category'``. We'll take this moment to remind users that the
202204

203205
See the :ref:`CategoricalDtype docs <categorical.categoricaldtype>` for more.
204206

207+
.. _whatsnew_0210.enhancements.GroupBy.pipe:
208+
209+
``GroupBy`` objects now have a ``pipe`` method
210+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
211+
212+
``GroupBy`` objects now have a ``pipe`` method, similar to the one on
213+
``DataFrame`` and ``Series``, that allow for functions that take a
214+
``GroupBy`` to be composed in a clean, readable syntax. (:issue:`17871`)
215+
216+
For a concrete example on combining ``.groupby`` and ``.pipe`` , imagine having a
217+
DataFrame with columns for stores, products, revenue and sold quantity. We'd like to
218+
do a groupwise calculation of *prices* (i.e. revenue/quantity per store and per product).
219+
We could do this in a multi-step operation, but expressing it in terms of piping can make the
220+
code more readable.
221+
222+
First we set the data:
223+
224+
.. ipython:: python
225+
226+
import numpy as np
227+
n = 1000
228+
df = pd.DataFrame({'Store': np.random.choice(['Store_1', 'Store_2'], n),
229+
'Product': np.random.choice(['Product_1', 'Product_2', 'Product_3'], n),
230+
'Revenue': (np.random.random(n)*50+10).round(2),
231+
'Quantity': np.random.randint(1, 10, size=n)})
232+
df.head(2)
233+
234+
Now, to find prices per store/product, we can simply do:
235+
236+
.. ipython:: python
237+
238+
(df.groupby(['Store', 'Product'])
239+
.pipe(lambda grp: grp.Revenue.sum()/grp.Quantity.sum())
240+
.unstack().round(2))
241+
242+
See the :ref:`documentation <groupby.pipe>` for more.
243+
205244
.. _whatsnew_0210.enhancements.other:
206245

207246
Other Enhancements

pandas/core/common.py

+35
Original file line numberDiff line numberDiff line change
@@ -664,3 +664,38 @@ def _get_distinct_objs(objs):
664664
ids.add(id(obj))
665665
res.append(obj)
666666
return res
667+
668+
669+
def _pipe(obj, func, *args, **kwargs):
670+
"""
671+
Apply a function ``func`` to object ``obj`` either by passing obj as the
672+
first argument to the function or, in the case that the func is a tuple,
673+
interpret the first element of the tuple as a function and pass the obj to
674+
that function as a keyword argument whose key is the value of the second
675+
element of the tuple.
676+
677+
Parameters
678+
----------
679+
func : callable or tuple of (callable, string)
680+
Function to apply to this object or, alternatively, a
681+
``(callable, data_keyword)`` tuple where ``data_keyword`` is a
682+
string indicating the keyword of `callable`` that expects the
683+
object.
684+
args : iterable, optional
685+
positional arguments passed into ``func``.
686+
kwargs : dict, optional
687+
a dictionary of keyword arguments passed into ``func``.
688+
689+
Returns
690+
-------
691+
object : the return type of ``func``.
692+
"""
693+
if isinstance(func, tuple):
694+
func, target = func
695+
if target in kwargs:
696+
msg = '%s is both the pipe target and a keyword argument' % target
697+
raise ValueError(msg)
698+
kwargs[target] = obj
699+
return func(*args, **kwargs)
700+
else:
701+
return func(obj, *args, **kwargs)

pandas/core/generic.py

+6-12
Original file line numberDiff line numberDiff line change
@@ -3497,8 +3497,10 @@ def sample(self, n=None, frac=None, replace=False, weights=None,
34973497
Alternatively a ``(callable, data_keyword)`` tuple where
34983498
``data_keyword`` is a string indicating the keyword of
34993499
``callable`` that expects the %(klass)s.
3500-
args : positional arguments passed into ``func``.
3501-
kwargs : a dictionary of keyword arguments passed into ``func``.
3500+
args : iterable, optional
3501+
positional arguments passed into ``func``.
3502+
kwargs : mapping, optional
3503+
a dictionary of keyword arguments passed into ``func``.
35023504
35033505
Returns
35043506
-------
@@ -3508,7 +3510,7 @@ def sample(self, n=None, frac=None, replace=False, weights=None,
35083510
-----
35093511
35103512
Use ``.pipe`` when chaining together functions that expect
3511-
on Series or DataFrames. Instead of writing
3513+
Series, DataFrames or GroupBys. Instead of writing
35123514
35133515
>>> f(g(h(df), arg1=a), arg2=b, arg3=c)
35143516
@@ -3537,15 +3539,7 @@ def sample(self, n=None, frac=None, replace=False, weights=None,
35373539

35383540
@Appender(_shared_docs['pipe'] % _shared_doc_kwargs)
35393541
def pipe(self, func, *args, **kwargs):
3540-
if isinstance(func, tuple):
3541-
func, target = func
3542-
if target in kwargs:
3543-
raise ValueError('%s is both the pipe target and a keyword '
3544-
'argument' % target)
3545-
kwargs[target] = self
3546-
return func(*args, **kwargs)
3547-
else:
3548-
return func(self, *args, **kwargs)
3542+
return com._pipe(self, func, *args, **kwargs)
35493543

35503544
_shared_docs['aggregate'] = ("""
35513545
Aggregate using callable, string, dict, or list of string/callables

pandas/core/groupby.py

+48-1
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@
4040

4141
from pandas.core.common import (_values_from_object, AbstractMethodError,
4242
_default_index, _not_none, _get_callable_name,
43-
_asarray_tuplesafe)
43+
_asarray_tuplesafe, _pipe)
4444

4545
from pandas.core.base import (PandasObject, SelectionMixin, GroupByError,
4646
DataError, SpecificationError)
@@ -1691,6 +1691,53 @@ def tail(self, n=5):
16911691
mask = self._cumcount_array(ascending=False) < n
16921692
return self._selected_obj[mask]
16931693

1694+
def pipe(self, func, *args, **kwargs):
1695+
""" Apply a function with arguments to this GroupBy object
1696+
1697+
.. versionadded:: 0.21.0
1698+
1699+
Parameters
1700+
----------
1701+
func : callable or tuple of (callable, string)
1702+
Function to apply to this GroupBy or, alternatively, a
1703+
``(callable, data_keyword)`` tuple where ``data_keyword`` is a
1704+
string indicating the keyword of `callable`` that expects the
1705+
GroupBy object.
1706+
args : iterable, optional
1707+
positional arguments passed into ``func``.
1708+
kwargs : dict, optional
1709+
a dictionary of keyword arguments passed into ``func``.
1710+
1711+
Returns
1712+
-------
1713+
object : the return type of ``func``.
1714+
1715+
Notes
1716+
-----
1717+
Use ``.pipe`` when chaining together functions that expect
1718+
Series, DataFrames or GroupBys. Instead of writing
1719+
1720+
>>> f(g(h(df.groupby('group')), arg1=a), arg2=b, arg3=c)
1721+
1722+
You can write
1723+
1724+
>>> (df
1725+
... .groupby('group')
1726+
... .pipe(f, arg1)
1727+
... .pipe(g, arg2)
1728+
... .pipe(h, arg3))
1729+
1730+
See more :ref:`here
1731+
<http://pandas.pydata.org/pandas-docs/stable/groupby.html#pipe>`
1732+
1733+
See Also
1734+
--------
1735+
pandas.Series.pipe
1736+
pandas.DataFrame.pipe
1737+
pandas.GroupBy.apply
1738+
"""
1739+
return _pipe(self, func, *args, **kwargs)
1740+
16941741

16951742
GroupBy._add_numeric_operations()
16961743

pandas/tests/groupby/test_groupby.py

+61
Original file line numberDiff line numberDiff line change
@@ -3762,6 +3762,67 @@ def test_gb_key_len_equal_axis_len(self):
37623762
assert df.loc[('foo', 'bar', 'B')] == 2
37633763
assert df.loc[('foo', 'baz', 'C')] == 1
37643764

3765+
def test_pipe(self):
3766+
# Test the pipe method of DataFrameGroupBy.
3767+
# Issue #17871
3768+
3769+
random_state = np.random.RandomState(1234567890)
3770+
3771+
df = DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
3772+
'foo', 'bar', 'foo', 'foo'],
3773+
'B': random_state.randn(8),
3774+
'C': random_state.randn(8)})
3775+
3776+
def f(dfgb):
3777+
return dfgb.B.max() - dfgb.C.min().min()
3778+
3779+
def square(srs):
3780+
return srs ** 2
3781+
3782+
# Note that the transformations are
3783+
# GroupBy -> Series
3784+
# Series -> Series
3785+
# This then chains the GroupBy.pipe and the
3786+
# NDFrame.pipe methods
3787+
result = df.groupby('A').pipe(f).pipe(square)
3788+
3789+
index = Index([u'bar', u'foo'], dtype='object', name=u'A')
3790+
expected = pd.Series([8.99110003361, 8.17516964785], name='B',
3791+
index=index)
3792+
3793+
assert_series_equal(expected, result)
3794+
3795+
def test_pipe_args(self):
3796+
# Test passing args to the pipe method of DataFrameGroupBy.
3797+
# Issue #17871
3798+
3799+
df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'C'],
3800+
'x': [1.0, 2.0, 3.0, 2.0, 5.0],
3801+
'y': [10.0, 100.0, 1000.0, -100.0, -1000.0]})
3802+
3803+
def f(dfgb, arg1):
3804+
return (dfgb.filter(lambda grp: grp.y.mean() > arg1, dropna=False)
3805+
.groupby(dfgb.grouper))
3806+
3807+
def g(dfgb, arg2):
3808+
return dfgb.sum() / dfgb.sum().sum() + arg2
3809+
3810+
def h(df, arg3):
3811+
return df.x + df.y - arg3
3812+
3813+
result = (df
3814+
.groupby('group')
3815+
.pipe(f, 0)
3816+
.pipe(g, 10)
3817+
.pipe(h, 100))
3818+
3819+
# Assert the results here
3820+
index = pd.Index(['A', 'B', 'C'], name='group')
3821+
expected = pd.Series([-79.5160891089, -78.4839108911, None],
3822+
index=index)
3823+
3824+
assert_series_equal(expected, result)
3825+
37653826

37663827
def _check_groupby(df, result, keys, field, f=lambda x: x.sum()):
37673828
tups = lmap(tuple, df[keys].values)

pandas/tests/groupby/test_whitelist.py

+11-11
Original file line numberDiff line numberDiff line change
@@ -239,17 +239,17 @@ def test_groupby_blacklist(df_letters):
239239
def test_tab_completion(mframe):
240240
grp = mframe.groupby(level='second')
241241
results = set([v for v in dir(grp) if not v.startswith('_')])
242-
expected = set(
243-
['A', 'B', 'C', 'agg', 'aggregate', 'apply', 'boxplot', 'filter',
244-
'first', 'get_group', 'groups', 'hist', 'indices', 'last', 'max',
245-
'mean', 'median', 'min', 'ngroups', 'nth', 'ohlc', 'plot',
246-
'prod', 'size', 'std', 'sum', 'transform', 'var', 'sem', 'count',
247-
'nunique', 'head', 'describe', 'cummax', 'quantile',
248-
'rank', 'cumprod', 'tail', 'resample', 'cummin', 'fillna',
249-
'cumsum', 'cumcount', 'ngroup', 'all', 'shift', 'skew',
250-
'take', 'tshift', 'pct_change', 'any', 'mad', 'corr', 'corrwith',
251-
'cov', 'dtypes', 'ndim', 'diff', 'idxmax', 'idxmin',
252-
'ffill', 'bfill', 'pad', 'backfill', 'rolling', 'expanding'])
242+
expected = {
243+
'A', 'B', 'C', 'agg', 'aggregate', 'apply', 'boxplot', 'filter',
244+
'first', 'get_group', 'groups', 'hist', 'indices', 'last', 'max',
245+
'mean', 'median', 'min', 'ngroups', 'nth', 'ohlc', 'plot',
246+
'prod', 'size', 'std', 'sum', 'transform', 'var', 'sem', 'count',
247+
'nunique', 'head', 'describe', 'cummax', 'quantile',
248+
'rank', 'cumprod', 'tail', 'resample', 'cummin', 'fillna',
249+
'cumsum', 'cumcount', 'ngroup', 'all', 'shift', 'skew',
250+
'take', 'tshift', 'pct_change', 'any', 'mad', 'corr', 'corrwith',
251+
'cov', 'dtypes', 'ndim', 'diff', 'idxmax', 'idxmin',
252+
'ffill', 'bfill', 'pad', 'backfill', 'rolling', 'expanding', 'pipe'}
253253
assert results == expected
254254

255255

0 commit comments

Comments
 (0)