Skip to content

Commit 9086bde

Browse files
Backport PR #51626 on branch 2.0.x (DOC: Improvements to groupby.rst) (#51642)
Backport PR #51626: DOC: Improvements to groupby.rst Co-authored-by: Richard Shadrach <[email protected]>
1 parent 608c2d8 commit 9086bde

File tree

1 file changed

+43
-37
lines changed

1 file changed

+43
-37
lines changed

doc/source/user_guide/groupby.rst

+43-37
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,22 @@ following:
3636
* Discard data that belongs to groups with only a few members.
3737
* Filter out data based on the group sum or mean.
3838

39-
* Some combination of the above: GroupBy will examine the results of the apply
40-
step and try to return a sensibly combined result if it doesn't fit into
41-
either of the above two categories.
39+
Many of these operations are defined on GroupBy objects. These operations are similar
40+
to the :ref:`aggregating API <basics.aggregate>`, :ref:`window API <window.overview>`,
41+
and :ref:`resample API <timeseries.aggregate>`.
42+
43+
It is possible that a given operation does not fall into one of these categories or
44+
is some combination of them. In such a case, it may be possible to compute the
45+
operation using GroupBy's ``apply`` method. This method will examine the results of the
46+
apply step and try to return a sensibly combined result if it doesn't fit into either
47+
of the above two categories.
48+
49+
.. note::
50+
51+
An operation that is split into multiple steps using built-in GroupBy operations
52+
will be more efficient than using the ``apply`` method with a user-defined Python
53+
function.
54+
4255

4356
Since the set of object instance methods on pandas data structures are generally
4457
rich and expressive, we often simply want to invoke, say, a DataFrame function
@@ -68,7 +81,7 @@ object (more on what the GroupBy object is later), you may do the following:
6881

6982
.. ipython:: python
7083
71-
df = pd.DataFrame(
84+
speeds = pd.DataFrame(
7285
[
7386
("bird", "Falconiformes", 389.0),
7487
("bird", "Psittaciformes", 24.0),
@@ -79,12 +92,12 @@ object (more on what the GroupBy object is later), you may do the following:
7992
index=["falcon", "parrot", "lion", "monkey", "leopard"],
8093
columns=("class", "order", "max_speed"),
8194
)
82-
df
95+
speeds
8396
8497
# default is axis=0
85-
grouped = df.groupby("class")
86-
grouped = df.groupby("order", axis="columns")
87-
grouped = df.groupby(["class", "order"])
98+
grouped = speeds.groupby("class")
99+
grouped = speeds.groupby("order", axis="columns")
100+
grouped = speeds.groupby(["class", "order"])
88101
89102
The mapping can be specified many different ways:
90103

@@ -1052,18 +1065,21 @@ The ``nlargest`` and ``nsmallest`` methods work on ``Series`` style groupbys:
10521065
Flexible ``apply``
10531066
------------------
10541067

1055-
Some operations on the grouped data might not fit into either the aggregate or
1056-
transform categories. Or, you may simply want GroupBy to infer how to combine
1057-
the results. For these, use the ``apply`` function, which can be substituted
1058-
for both ``aggregate`` and ``transform`` in many standard use cases. However,
1059-
``apply`` can handle some exceptional use cases.
1068+
Some operations on the grouped data might not fit into the aggregation,
1069+
transformation, or filtration categories. For these, you can use the ``apply``
1070+
function.
1071+
1072+
.. warning::
1073+
1074+
``apply`` has to try to infer from the result whether it should act as a reducer,
1075+
transformer, *or* filter, depending on exactly what is passed to it. Thus the
1076+
grouped column(s) may be included in the output or not. While
1077+
it tries to intelligently guess how to behave, it can sometimes guess wrong.
10601078

10611079
.. note::
10621080

1063-
``apply`` can act as a reducer, transformer, *or* filter function, depending
1064-
on exactly what is passed to it. It can depend on the passed function and
1065-
exactly what you are grouping. Thus the grouped column(s) may be included in
1066-
the output as well as set the indices.
1081+
All of the examples in this section can be more reliably, and more efficiently,
1082+
computed using other pandas functionality.
10671083

10681084
.. ipython:: python
10691085
@@ -1098,10 +1114,14 @@ that is itself a series, and possibly upcast the result to a DataFrame:
10981114
s
10991115
s.apply(f)
11001116
1117+
Similar to :ref:`groupby.aggregate.udfs`, the resulting dtype will reflect that of the
1118+
apply function. If the results from different groups have different dtypes, then
1119+
a common dtype will be determined in the same way as ``DataFrame`` construction.
1120+
11011121
Control grouped column(s) placement with ``group_keys``
11021122
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11031123

1104-
.. note::
1124+
.. versionchanged:: 1.5.0
11051125

11061126
If ``group_keys=True`` is specified when calling :meth:`~DataFrame.groupby`,
11071127
functions passed to ``apply`` that return like-indexed outputs will have the
@@ -1111,8 +1131,6 @@ Control grouped column(s) placement with ``group_keys``
11111131
not be added for like-indexed outputs. In the future this behavior
11121132
will change to always respect ``group_keys``, which defaults to ``True``.
11131133

1114-
.. versionchanged:: 1.5.0
1115-
11161134
To control whether the grouped column(s) are included in the indices, you can use
11171135
the argument ``group_keys``. Compare
11181136

@@ -1126,11 +1144,6 @@ with
11261144
11271145
df.groupby("A", group_keys=False).apply(lambda x: x)
11281146
1129-
Similar to :ref:`groupby.aggregate.udfs`, the resulting dtype will reflect that of the
1130-
apply function. If the results from different groups have different dtypes, then
1131-
a common dtype will be determined in the same way as ``DataFrame`` construction.
1132-
1133-
11341147
Numba Accelerated Routines
11351148
--------------------------
11361149

@@ -1153,8 +1166,8 @@ will be passed into ``values``, and the group index will be passed into ``index`
11531166
Other useful features
11541167
---------------------
11551168

1156-
Automatic exclusion of "nuisance" columns
1157-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1169+
Exclusion of "nuisance" columns
1170+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11581171

11591172
Again consider the example DataFrame we've been looking at:
11601173

@@ -1164,8 +1177,8 @@ Again consider the example DataFrame we've been looking at:
11641177
11651178
Suppose we wish to compute the standard deviation grouped by the ``A``
11661179
column. There is a slight problem, namely that we don't care about the data in
1167-
column ``B``. We refer to this as a "nuisance" column. You can avoid nuisance
1168-
columns by specifying ``numeric_only=True``:
1180+
column ``B`` because it is not numeric. We refer to these non-numeric columns as
1181+
"nuisance" columns. You can avoid nuisance columns by specifying ``numeric_only=True``:
11691182

11701183
.. ipython:: python
11711184
@@ -1178,20 +1191,13 @@ is only interesting over one column (here ``colname``), it may be filtered
11781191

11791192
.. note::
11801193
Any object column, also if it contains numerical values such as ``Decimal``
1181-
objects, is considered as a "nuisance" columns. They are excluded from
1194+
objects, is considered as a "nuisance" column. They are excluded from
11821195
aggregate functions automatically in groupby.
11831196

11841197
If you do wish to include decimal or object columns in an aggregation with
11851198
other non-nuisance data types, you must do so explicitly.
11861199

1187-
.. warning::
1188-
The automatic dropping of nuisance columns has been deprecated and will be removed
1189-
in a future version of pandas. If columns are included that cannot be operated
1190-
on, pandas will instead raise an error. In order to avoid this, either select
1191-
the columns you wish to operate on or specify ``numeric_only=True``.
1192-
11931200
.. ipython:: python
1194-
:okwarning:
11951201
11961202
from decimal import Decimal
11971203

0 commit comments

Comments
 (0)