Skip to content

DOC Trying to improve Group by split-apply-combine guide #51916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 18, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 45 additions & 45 deletions doc/source/user_guide/groupby.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,20 +31,20 @@ following:
* Filling NAs within groups with a value derived from each group.

* **Filtration**: discard some groups, according to a group-wise computation
that evaluates True or False. Some examples:
that evaluates to True or False. Some examples:

* Discard data that belongs to groups with only a few members.
* Discard data that belong to groups with only a few members.
* Filter out data based on the group sum or mean.

Many of these operations are defined on GroupBy objects. These operations are similar
to the :ref:`aggregating API <basics.aggregate>`, :ref:`window API <window.overview>`,
and :ref:`resample API <timeseries.aggregate>`.
to those of the :ref:`aggregating API <basics.aggregate>`,
:ref:`window API <window.overview>`, and :ref:`resample API <timeseries.aggregate>`.

It is possible that a given operation does not fall into one of these categories or
is some combination of them. In such a case, it may be possible to compute the
operation using GroupBy's ``apply`` method. This method will examine the results of the
apply step and try to return a sensibly combined result if it doesn't fit into either
of the above two categories.
apply step and try to sensibly combine them into a single result if it doesn't fit into either
of the above three categories.

.. note::

Expand All @@ -53,7 +53,7 @@ of the above two categories.
function.


Since the set of object instance methods on pandas data structures are generally
Since the set of object instance methods on pandas data structures is generally
rich and expressive, we often simply want to invoke, say, a DataFrame function
on each group. The name GroupBy should be quite familiar to those who have used
a SQL-based tool (or ``itertools``), in which you can write code like:
Expand All @@ -75,9 +75,9 @@ See the :ref:`cookbook<cookbook.grouping>` for some advanced strategies.
Splitting an object into groups
-------------------------------

pandas objects can be split on any of their axes. The abstract definition of
grouping is to provide a mapping of labels to group names. To create a GroupBy
object (more on what the GroupBy object is later), you may do the following:
The abstract definition of grouping is to provide a mapping of labels to
group names. To create a GroupBy object (more on what the GroupBy object is
later), you may do the following:

.. ipython:: python
Expand All @@ -99,12 +99,11 @@ object (more on what the GroupBy object is later), you may do the following:
The mapping can be specified many different ways:

* A Python function, to be called on each of the axis labels.
* A Python function, to be called on each of the index labels.
* A list or NumPy array of the same length as the index.
* A dict or ``Series``, providing a ``label -> group name`` mapping.
* For ``DataFrame`` objects, a string indicating either a column name or
an index level name to be used to group.
* ``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``.
* A list of any of the above things.

Collectively we refer to the grouping objects as the **keys**. For example,
Expand Down Expand Up @@ -136,16 +135,20 @@ We could naturally group by either the ``A`` or ``B`` columns, or both:
grouped = df.groupby("A")
grouped = df.groupby(["A", "B"])
.. note::

``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``.

If we also have a MultiIndex on columns ``A`` and ``B``, we can group by all
but the specified columns
the columns except the one we specify:

.. ipython:: python
df2 = df.set_index(["A", "B"])
grouped = df2.groupby(level=df2.index.names.difference(["B"]))
grouped.sum()
These will split the DataFrame on its index (rows). To split by columns, first do
The above GroupBy will split the DataFrame on its index (rows). To split by columns, first do
a tranpose:

.. ipython::
Expand Down Expand Up @@ -184,8 +187,8 @@ only verifies that you've passed a valid mapping.
.. note::

Many kinds of complicated data manipulations can be expressed in terms of
GroupBy operations (though can't be guaranteed to be the most
efficient). You can get quite creative with the label mapping functions.
GroupBy operations (though it can't be guaranteed to be the most efficient implementation).
You can get quite creative with the label mapping functions.

.. _groupby.sorting:

Expand Down Expand Up @@ -245,8 +248,8 @@ The default setting of ``dropna`` argument is ``True`` which means ``NA`` are no
GroupBy object attributes
~~~~~~~~~~~~~~~~~~~~~~~~~

The ``groups`` attribute is a dict whose keys are the computed unique groups
and corresponding values being the axis labels belonging to each group. In the
The ``groups`` attribute is a dictionary whose keys are the computed unique groups
and corresponding values are the axis labels belonging to each group. In the
above example we have:

.. ipython:: python
Expand Down Expand Up @@ -358,9 +361,10 @@ More on the ``sum`` function and aggregation later.

Grouping DataFrame with Index levels and columns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A DataFrame may be grouped by a combination of columns and index levels by
specifying the column names as strings and the index levels as ``pd.Grouper``
objects.
A DataFrame may be grouped by a combination of columns and index levels. You
can specify both column and index names, or use a :class:`Grouper`.

Let's first create a DataFrame with a MultiIndex:

.. ipython:: python
Expand All @@ -375,8 +379,7 @@ objects.
df
The following example groups ``df`` by the ``second`` index level and
the ``A`` column.
Then we group ``df`` by the ``second`` index level and the ``A`` column.

.. ipython:: python
Expand All @@ -398,8 +401,8 @@ DataFrame column selection in GroupBy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Once you have created the GroupBy object from a DataFrame, you might want to do
something different for each of the columns. Thus, using ``[]`` similar to
getting a column from a DataFrame, you can do:
something different for each of the columns. Thus, by using ``[]`` on the GroupBy
object in a similar way as the one used to get a column from a DataFrame, you can do:

.. ipython:: python
Expand All @@ -418,13 +421,13 @@ getting a column from a DataFrame, you can do:
grouped_C = grouped["C"]
grouped_D = grouped["D"]
This is mainly syntactic sugar for the alternative and much more verbose:
This is mainly syntactic sugar for the alternative, which is much more verbose:

.. ipython:: python
df["C"].groupby(df["A"])
Additionally this method avoids recomputing the internal grouping information
Additionally, this method avoids recomputing the internal grouping information
derived from the passed key.

.. _groupby.iterating-label:
Expand Down Expand Up @@ -1218,7 +1221,7 @@ The dimension of the returned result can also change:
grouped.apply(f)
``apply`` on a Series can operate on a returned value from the applied function,
``apply`` on a Series can operate on a returned value from the applied function
that is itself a series, and possibly upcast the result to a DataFrame:

.. ipython:: python
Expand Down Expand Up @@ -1303,18 +1306,10 @@ column ``B`` because it is not numeric. We refer to these non-numeric columns as
df.groupby("A").std(numeric_only=True)
Note that ``df.groupby('A').colname.std().`` is more efficient than
``df.groupby('A').std().colname``, so if the result of an aggregation function
is only interesting over one column (here ``colname``), it may be filtered
``df.groupby('A').std().colname``. So if the result of an aggregation function
is only needed over one column (here ``colname``), it may be filtered
*before* applying the aggregation function.

.. note::
Any object column, also if it contains numerical values such as ``Decimal``
objects, is considered as a "nuisance" column. They are excluded from
aggregate functions automatically in groupby.

If you do wish to include decimal or object columns in an aggregation with
other non-nuisance data types, you must do so explicitly.

.. ipython:: python
from decimal import Decimal
Expand Down Expand Up @@ -1573,9 +1568,9 @@ order they are first observed.
Plotting
~~~~~~~~

Groupby also works with some plotting methods. For example, suppose we
suspect that some features in a DataFrame may differ by group, in this case,
the values in column 1 where the group is "B" are 3 higher on average.
Groupby also works with some plotting methods. In this case, suppose we
suspect that the values in column 1 are 3 times higher on average in group "B".


.. ipython:: python
Expand Down Expand Up @@ -1657,7 +1652,7 @@ arbitrary function, for example:
df.groupby(["Store", "Product"]).pipe(mean)
where ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity
Here ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity
columns respectively for each Store-Product combination. The ``mean`` function can
be any function that takes in a GroupBy object; the ``.pipe`` will pass the GroupBy
object as a parameter into the function you specify.
Expand Down Expand Up @@ -1709,11 +1704,16 @@ Groupby by indexer to 'resample' data

Resampling produces new hypothetical samples (resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.

In order to resample to work on indices that are non-datetimelike, the following procedure can be utilized.
In order for resample to work on indices that are non-datetimelike, the following procedure can be utilized.

In the following examples, **df.index // 5** returns a binary array which is used to determine what gets selected for the groupby operation.

.. note:: The below example shows how we can downsample by consolidation of samples into fewer samples. Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()** function, we aggregate the information contained in many samples into a small subset of values which is their standard deviation thereby reducing the number of samples.
.. note::

The example below shows how we can downsample by consolidation of samples into fewer ones.
Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()**
function, we aggregate the information contained in many samples into a small subset of values
which is their standard deviation thereby reducing the number of samples.

.. ipython:: python
Expand All @@ -1727,7 +1727,7 @@ Returning a Series to propagate names

Group DataFrame columns, compute a set of metrics and return a named Series.
The Series name is used as the name for the column index. This is especially
useful in conjunction with reshaping operations such as stacking in which the
useful in conjunction with reshaping operations such as stacking, in which the
column index name will be used as the name of the inserted column:

.. ipython:: python
Expand Down