@@ -36,9 +36,22 @@ following:
36
36
* Discard data that belongs to groups with only a few members.
37
37
* Filter out data based on the group sum or mean.
38
38
39
- * Some combination of the above: GroupBy will examine the results of the apply
40
- step and try to return a sensibly combined result if it doesn't fit into
41
- either of the above two categories.
39
+ Many of these operations are defined on GroupBy objects. These operations are similar
40
+ to the :ref: `aggregating API <basics.aggregate >`, :ref: `window API <window.overview >`,
41
+ and :ref: `resample API <timeseries.aggregate >`.
42
+
43
+ It is possible that a given operation does not fall into one of these categories or
44
+ is some combination of them. In such a case, it may be possible to compute the
45
+ operation using GroupBy's ``apply `` method. This method will examine the results of the
46
+ apply step and try to return a sensibly combined result if it doesn't fit into either
47
+ of the above two categories.
48
+
49
+ .. note ::
50
+
51
+ An operation that is split into multiple steps using built-in GroupBy operations
52
+ will be more efficient than using the ``apply `` method with a user-defined Python
53
+ function.
54
+
42
55
43
56
Since the set of object instance methods on pandas data structures are generally
44
57
rich and expressive, we often simply want to invoke, say, a DataFrame function
@@ -68,7 +81,7 @@ object (more on what the GroupBy object is later), you may do the following:
68
81
69
82
.. ipython :: python
70
83
71
- df = pd.DataFrame(
84
+ speeds = pd.DataFrame(
72
85
[
73
86
(" bird" , " Falconiformes" , 389.0 ),
74
87
(" bird" , " Psittaciformes" , 24.0 ),
@@ -79,12 +92,12 @@ object (more on what the GroupBy object is later), you may do the following:
79
92
index = [" falcon" , " parrot" , " lion" , " monkey" , " leopard" ],
80
93
columns = (" class" , " order" , " max_speed" ),
81
94
)
82
- df
95
+ speeds
83
96
84
97
# default is axis=0
85
- grouped = df .groupby(" class" )
86
- grouped = df .groupby(" order" , axis = " columns" )
87
- grouped = df .groupby([" class" , " order" ])
98
+ grouped = speeds .groupby(" class" )
99
+ grouped = speeds .groupby(" order" , axis = " columns" )
100
+ grouped = speeds .groupby([" class" , " order" ])
88
101
89
102
The mapping can be specified many different ways:
90
103
@@ -1052,18 +1065,21 @@ The ``nlargest`` and ``nsmallest`` methods work on ``Series`` style groupbys:
1052
1065
Flexible ``apply ``
1053
1066
------------------
1054
1067
1055
- Some operations on the grouped data might not fit into either the aggregate or
1056
- transform categories. Or, you may simply want GroupBy to infer how to combine
1057
- the results. For these, use the ``apply `` function, which can be substituted
1058
- for both ``aggregate `` and ``transform `` in many standard use cases. However,
1059
- ``apply `` can handle some exceptional use cases.
1068
+ Some operations on the grouped data might not fit into the aggregation,
1069
+ transformation, or filtration categories. For these, you can use the ``apply ``
1070
+ function.
1071
+
1072
+ .. warning ::
1073
+
1074
+ ``apply `` has to try to infer from the result whether it should act as a reducer,
1075
+ transformer, *or * filter, depending on exactly what is passed to it. Thus the
1076
+ grouped column(s) may be included in the output or not. While
1077
+ it tries to intelligently guess how to behave, it can sometimes guess wrong.
1060
1078
1061
1079
.. note ::
1062
1080
1063
- ``apply `` can act as a reducer, transformer, *or * filter function, depending
1064
- on exactly what is passed to it. It can depend on the passed function and
1065
- exactly what you are grouping. Thus the grouped column(s) may be included in
1066
- the output as well as set the indices.
1081
+ All of the examples in this section can be more reliably, and more efficiently,
1082
+ computed using other pandas functionality.
1067
1083
1068
1084
.. ipython :: python
1069
1085
@@ -1098,10 +1114,14 @@ that is itself a series, and possibly upcast the result to a DataFrame:
1098
1114
s
1099
1115
s.apply(f)
1100
1116
1117
+ Similar to :ref: `groupby.aggregate.udfs `, the resulting dtype will reflect that of the
1118
+ apply function. If the results from different groups have different dtypes, then
1119
+ a common dtype will be determined in the same way as ``DataFrame `` construction.
1120
+
1101
1121
Control grouped column(s) placement with ``group_keys ``
1102
1122
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1103
1123
1104
- .. note ::
1124
+ .. versionchanged :: 1.5.0
1105
1125
1106
1126
If ``group_keys=True `` is specified when calling :meth: `~DataFrame.groupby `,
1107
1127
functions passed to ``apply `` that return like-indexed outputs will have the
@@ -1111,8 +1131,6 @@ Control grouped column(s) placement with ``group_keys``
1111
1131
not be added for like-indexed outputs. In the future this behavior
1112
1132
will change to always respect ``group_keys ``, which defaults to ``True ``.
1113
1133
1114
- .. versionchanged :: 1.5.0
1115
-
1116
1134
To control whether the grouped column(s) are included in the indices, you can use
1117
1135
the argument ``group_keys ``. Compare
1118
1136
@@ -1126,11 +1144,6 @@ with
1126
1144
1127
1145
df.groupby(" A" , group_keys = False ).apply(lambda x : x)
1128
1146
1129
- Similar to :ref: `groupby.aggregate.udfs `, the resulting dtype will reflect that of the
1130
- apply function. If the results from different groups have different dtypes, then
1131
- a common dtype will be determined in the same way as ``DataFrame `` construction.
1132
-
1133
-
1134
1147
Numba Accelerated Routines
1135
1148
--------------------------
1136
1149
@@ -1153,8 +1166,8 @@ will be passed into ``values``, and the group index will be passed into ``index`
1153
1166
Other useful features
1154
1167
---------------------
1155
1168
1156
- Automatic exclusion of "nuisance" columns
1157
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1169
+ Exclusion of "nuisance" columns
1170
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1158
1171
1159
1172
Again consider the example DataFrame we've been looking at:
1160
1173
@@ -1164,8 +1177,8 @@ Again consider the example DataFrame we've been looking at:
1164
1177
1165
1178
Suppose we wish to compute the standard deviation grouped by the ``A ``
1166
1179
column. There is a slight problem, namely that we don't care about the data in
1167
- column ``B ``. We refer to this as a "nuisance" column. You can avoid nuisance
1168
- columns by specifying ``numeric_only=True ``:
1180
+ column ``B `` because it is not numeric. We refer to these non-numeric columns as
1181
+ "nuisance" columns. You can avoid nuisance columns by specifying ``numeric_only=True ``:
1169
1182
1170
1183
.. ipython :: python
1171
1184
@@ -1178,20 +1191,13 @@ is only interesting over one column (here ``colname``), it may be filtered
1178
1191
1179
1192
.. note ::
1180
1193
Any object column, also if it contains numerical values such as ``Decimal ``
1181
- objects, is considered as a "nuisance" columns . They are excluded from
1194
+ objects, is considered as a "nuisance" column . They are excluded from
1182
1195
aggregate functions automatically in groupby.
1183
1196
1184
1197
If you do wish to include decimal or object columns in an aggregation with
1185
1198
other non-nuisance data types, you must do so explicitly.
1186
1199
1187
- .. warning ::
1188
- The automatic dropping of nuisance columns has been deprecated and will be removed
1189
- in a future version of pandas. If columns are included that cannot be operated
1190
- on, pandas will instead raise an error. In order to avoid this, either select
1191
- the columns you wish to operate on or specify ``numeric_only=True ``.
1192
-
1193
1200
.. ipython :: python
1194
- :okwarning:
1195
1201
1196
1202
from decimal import Decimal
1197
1203
0 commit comments