Skip to content

Commit b9507a2

Browse files
committed
ENH: Add documentation for fast_apply on first row
1 parent 90d5c5c commit b9507a2

File tree

2 files changed

+92
-8
lines changed

2 files changed

+92
-8
lines changed

doc/source/user_guide/groupby.rst

+28-8
Original file line numberDiff line numberDiff line change
@@ -948,18 +948,38 @@ that is itself a series, and possibly upcast the result to a DataFrame:
948948

949949
.. warning::
950950

951-
In the current implementation apply calls func twice on the
952-
first group to decide whether it can take a fast or slow code
953-
path. This can lead to unexpected behavior if func has
954-
side-effects, as they will take effect twice for the first
955-
group.
951+
The current implementation uses a cythonized code path which requires
952+
that the input data is not modified inplace. The heuristic assumes that
953+
this might be happening if ``func(group) is group`` in which case we fall
954+
back to a slow code path which evaluates func on the first group a second
955+
time.
956+
This can lead to unexpected behavior if func has side-effects,
957+
as they will take effect twice for the first group.
958+
This behavior is
956959

957960
.. ipython:: python
958961
959962
d = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
960-
def identity(df):
961-
print(df)
962-
return df
963+
964+
def func_fast_apply(group):
965+
"""
966+
This func doesn't modify inplace and returns
967+
a scalar which is safe to fast apply
968+
"""
969+
print(group.name)
970+
return len(group)
971+
972+
d.groupby("a").apply(func_fast_apply)
973+
974+
def identity(group):
975+
"""
976+
This triggers the slow path because ``identity(group) is group``
977+
If there is no inplace modification happening
978+
this may be avoided by returning a shallow copy
979+
i.e. return group.copy()
980+
"""
981+
print(group.name)
982+
return group
963983
964984
d.groupby("a").apply(identity)
965985

doc/source/whatsnew/v0.25.0.rst

+64
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,70 @@ Other Enhancements
2626
Backwards incompatible API changes
2727
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2828

29+
Fast GroupBy.apply on ``DataFrame`` evaluates first group only once
30+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
31+
32+
(:issue:`2936`, :issue:`2656`, :issue:`7739`, :issue:`10519`, :issue:`12155`,
33+
:issue:`20084`, :issue:`21417`)
34+
35+
The implementation of ``DataFrame.groupby.apply`` previously evaluated func
36+
consistently twice on the first group to infer if it is safe to use a fast
37+
code path. Particularly for functions with side effects, this was an undesired
38+
behavior and may have led to surprises.
39+
40+
The new behavior is that the first group is no longer evaluated twice if the
41+
fast path can be used.
42+
43+
Previous behavior:
44+
45+
.. code-block:: ipython
46+
47+
In [2]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
48+
49+
In [3]: side_effects = []
50+
51+
In [4]: def func_fast_apply(group):
52+
...: side_effects.append(group.name)
53+
...: return len(group)
54+
...:
55+
56+
In [5]: df.groupby("a").apply(func_fast_apply)
57+
58+
In [6]: assert side_effects == ["x", "x", "y"]
59+
60+
New behavior:
61+
62+
.. ipython:: python
63+
64+
df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
65+
66+
side_effects = []
67+
def func_fast_apply(group):
68+
"""
69+
This func doesn't modify inplace and returns
70+
a scalar which is safe to fast apply
71+
"""
72+
side_effects.append(group.name)
73+
return len(group)
74+
75+
df.groupby("a").apply(func_fast_apply)
76+
side_effects
77+
78+
side_effects.clear()
79+
def identity(group):
80+
"""
81+
This triggers the slow path because ``identity(group) is group``
82+
If there is no inplace modification happening
83+
this may be avoided by returning a shallow copy
84+
i.e. return group.copy()
85+
"""
86+
side_effects.append(group.name)
87+
return group
88+
89+
df.groupby("a").apply(identity)
90+
side_effects
91+
92+
2993
.. _whatsnew_0250.api.other:
3094

3195
Other API Changes

0 commit comments

Comments
 (0)