Skip to content

Commit c73c1c8

Browse files
authored
DOC: Improve Group by split-apply-combine guide (#51916)
1 parent d8d1a47 commit c73c1c8

File tree

1 file changed

+45
-45
lines changed

1 file changed

+45
-45
lines changed

doc/source/user_guide/groupby.rst

+45-45
Original file line numberDiff line numberDiff line change
@@ -31,20 +31,20 @@ following:
3131
* Filling NAs within groups with a value derived from each group.
3232

3333
* **Filtration**: discard some groups, according to a group-wise computation
34-
that evaluates True or False. Some examples:
34+
that evaluates to True or False. Some examples:
3535

36-
* Discard data that belongs to groups with only a few members.
36+
* Discard data that belong to groups with only a few members.
3737
* Filter out data based on the group sum or mean.
3838

3939
Many of these operations are defined on GroupBy objects. These operations are similar
40-
to the :ref:`aggregating API <basics.aggregate>`, :ref:`window API <window.overview>`,
41-
and :ref:`resample API <timeseries.aggregate>`.
40+
to those of the :ref:`aggregating API <basics.aggregate>`,
41+
:ref:`window API <window.overview>`, and :ref:`resample API <timeseries.aggregate>`.
4242

4343
It is possible that a given operation does not fall into one of these categories or
4444
is some combination of them. In such a case, it may be possible to compute the
4545
operation using GroupBy's ``apply`` method. This method will examine the results of the
46-
apply step and try to return a sensibly combined result if it doesn't fit into either
47-
of the above two categories.
46+
apply step and try to sensibly combine them into a single result if it doesn't fit into either
47+
of the above three categories.
4848

4949
.. note::
5050

@@ -53,7 +53,7 @@ of the above two categories.
5353
function.
5454

5555

56-
Since the set of object instance methods on pandas data structures are generally
56+
Since the set of object instance methods on pandas data structures is generally
5757
rich and expressive, we often simply want to invoke, say, a DataFrame function
5858
on each group. The name GroupBy should be quite familiar to those who have used
5959
a SQL-based tool (or ``itertools``), in which you can write code like:
@@ -75,9 +75,9 @@ See the :ref:`cookbook<cookbook.grouping>` for some advanced strategies.
7575
Splitting an object into groups
7676
-------------------------------
7777

78-
pandas objects can be split on any of their axes. The abstract definition of
79-
grouping is to provide a mapping of labels to group names. To create a GroupBy
80-
object (more on what the GroupBy object is later), you may do the following:
78+
The abstract definition of grouping is to provide a mapping of labels to
79+
group names. To create a GroupBy object (more on what the GroupBy object is
80+
later), you may do the following:
8181

8282
.. ipython:: python
8383
@@ -99,12 +99,11 @@ object (more on what the GroupBy object is later), you may do the following:
9999
100100
The mapping can be specified many different ways:
101101

102-
* A Python function, to be called on each of the axis labels.
102+
* A Python function, to be called on each of the index labels.
103103
* A list or NumPy array of the same length as the index.
104104
* A dict or ``Series``, providing a ``label -> group name`` mapping.
105105
* For ``DataFrame`` objects, a string indicating either a column name or
106106
an index level name to be used to group.
107-
* ``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``.
108107
* A list of any of the above things.
109108

110109
Collectively we refer to the grouping objects as the **keys**. For example,
@@ -136,16 +135,20 @@ We could naturally group by either the ``A`` or ``B`` columns, or both:
136135
grouped = df.groupby("A")
137136
grouped = df.groupby(["A", "B"])
138137
138+
.. note::
139+
140+
``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``.
141+
139142
If we also have a MultiIndex on columns ``A`` and ``B``, we can group by all
140-
but the specified columns
143+
the columns except the one we specify:
141144

142145
.. ipython:: python
143146
144147
df2 = df.set_index(["A", "B"])
145148
grouped = df2.groupby(level=df2.index.names.difference(["B"]))
146149
grouped.sum()
147150
148-
These will split the DataFrame on its index (rows). To split by columns, first do
151+
The above GroupBy will split the DataFrame on its index (rows). To split by columns, first do
149152
a tranpose:
150153

151154
.. ipython::
@@ -184,8 +187,8 @@ only verifies that you've passed a valid mapping.
184187
.. note::
185188

186189
Many kinds of complicated data manipulations can be expressed in terms of
187-
GroupBy operations (though can't be guaranteed to be the most
188-
efficient). You can get quite creative with the label mapping functions.
190+
GroupBy operations (though it can't be guaranteed to be the most efficient implementation).
191+
You can get quite creative with the label mapping functions.
189192

190193
.. _groupby.sorting:
191194

@@ -245,8 +248,8 @@ The default setting of ``dropna`` argument is ``True`` which means ``NA`` are no
245248
GroupBy object attributes
246249
~~~~~~~~~~~~~~~~~~~~~~~~~
247250

248-
The ``groups`` attribute is a dict whose keys are the computed unique groups
249-
and corresponding values being the axis labels belonging to each group. In the
251+
The ``groups`` attribute is a dictionary whose keys are the computed unique groups
252+
and corresponding values are the axis labels belonging to each group. In the
250253
above example we have:
251254

252255
.. ipython:: python
@@ -358,9 +361,10 @@ More on the ``sum`` function and aggregation later.
358361

359362
Grouping DataFrame with Index levels and columns
360363
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
361-
A DataFrame may be grouped by a combination of columns and index levels by
362-
specifying the column names as strings and the index levels as ``pd.Grouper``
363-
objects.
364+
A DataFrame may be grouped by a combination of columns and index levels. You
365+
can specify both column and index names, or use a :class:`Grouper`.
366+
367+
Let's first create a DataFrame with a MultiIndex:
364368

365369
.. ipython:: python
366370
@@ -375,8 +379,7 @@ objects.
375379
376380
df
377381
378-
The following example groups ``df`` by the ``second`` index level and
379-
the ``A`` column.
382+
Then we group ``df`` by the ``second`` index level and the ``A`` column.
380383

381384
.. ipython:: python
382385
@@ -398,8 +401,8 @@ DataFrame column selection in GroupBy
398401
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
399402

400403
Once you have created the GroupBy object from a DataFrame, you might want to do
401-
something different for each of the columns. Thus, using ``[]`` similar to
402-
getting a column from a DataFrame, you can do:
404+
something different for each of the columns. Thus, by using ``[]`` on the GroupBy
405+
object in a similar way as the one used to get a column from a DataFrame, you can do:
403406

404407
.. ipython:: python
405408
@@ -418,13 +421,13 @@ getting a column from a DataFrame, you can do:
418421
grouped_C = grouped["C"]
419422
grouped_D = grouped["D"]
420423
421-
This is mainly syntactic sugar for the alternative and much more verbose:
424+
This is mainly syntactic sugar for the alternative, which is much more verbose:
422425

423426
.. ipython:: python
424427
425428
df["C"].groupby(df["A"])
426429
427-
Additionally this method avoids recomputing the internal grouping information
430+
Additionally, this method avoids recomputing the internal grouping information
428431
derived from the passed key.
429432

430433
.. _groupby.iterating-label:
@@ -1218,7 +1221,7 @@ The dimension of the returned result can also change:
12181221
12191222
grouped.apply(f)
12201223
1221-
``apply`` on a Series can operate on a returned value from the applied function,
1224+
``apply`` on a Series can operate on a returned value from the applied function
12221225
that is itself a series, and possibly upcast the result to a DataFrame:
12231226

12241227
.. ipython:: python
@@ -1303,18 +1306,10 @@ column ``B`` because it is not numeric. We refer to these non-numeric columns as
13031306
df.groupby("A").std(numeric_only=True)
13041307
13051308
Note that ``df.groupby('A').colname.std().`` is more efficient than
1306-
``df.groupby('A').std().colname``, so if the result of an aggregation function
1307-
is only interesting over one column (here ``colname``), it may be filtered
1309+
``df.groupby('A').std().colname``. So if the result of an aggregation function
1310+
is only needed over one column (here ``colname``), it may be filtered
13081311
*before* applying the aggregation function.
13091312

1310-
.. note::
1311-
Any object column, also if it contains numerical values such as ``Decimal``
1312-
objects, is considered as a "nuisance" column. They are excluded from
1313-
aggregate functions automatically in groupby.
1314-
1315-
If you do wish to include decimal or object columns in an aggregation with
1316-
other non-nuisance data types, you must do so explicitly.
1317-
13181313
.. ipython:: python
13191314
13201315
from decimal import Decimal
@@ -1573,9 +1568,9 @@ order they are first observed.
15731568
Plotting
15741569
~~~~~~~~
15751570

1576-
Groupby also works with some plotting methods. For example, suppose we
1577-
suspect that some features in a DataFrame may differ by group, in this case,
1578-
the values in column 1 where the group is "B" are 3 higher on average.
1571+
Groupby also works with some plotting methods. In this case, suppose we
1572+
suspect that the values in column 1 are 3 times higher on average in group "B".
1573+
15791574

15801575
.. ipython:: python
15811576
@@ -1657,7 +1652,7 @@ arbitrary function, for example:
16571652
16581653
df.groupby(["Store", "Product"]).pipe(mean)
16591654
1660-
where ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity
1655+
Here ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity
16611656
columns respectively for each Store-Product combination. The ``mean`` function can
16621657
be any function that takes in a GroupBy object; the ``.pipe`` will pass the GroupBy
16631658
object as a parameter into the function you specify.
@@ -1709,11 +1704,16 @@ Groupby by indexer to 'resample' data
17091704

17101705
Resampling produces new hypothetical samples (resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.
17111706

1712-
In order to resample to work on indices that are non-datetimelike, the following procedure can be utilized.
1707+
In order for resample to work on indices that are non-datetimelike, the following procedure can be utilized.
17131708

17141709
In the following examples, **df.index // 5** returns a binary array which is used to determine what gets selected for the groupby operation.
17151710

1716-
.. note:: The below example shows how we can downsample by consolidation of samples into fewer samples. Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()** function, we aggregate the information contained in many samples into a small subset of values which is their standard deviation thereby reducing the number of samples.
1711+
.. note::
1712+
1713+
The example below shows how we can downsample by consolidation of samples into fewer ones.
1714+
Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()**
1715+
function, we aggregate the information contained in many samples into a small subset of values
1716+
which is their standard deviation thereby reducing the number of samples.
17171717

17181718
.. ipython:: python
17191719
@@ -1727,7 +1727,7 @@ Returning a Series to propagate names
17271727

17281728
Group DataFrame columns, compute a set of metrics and return a named Series.
17291729
The Series name is used as the name for the column index. This is especially
1730-
useful in conjunction with reshaping operations such as stacking in which the
1730+
useful in conjunction with reshaping operations such as stacking, in which the
17311731
column index name will be used as the name of the inserted column:
17321732

17331733
.. ipython:: python

0 commit comments

Comments
 (0)