DOC: Improve Group by split-apply-combine guide (#51916)

DeaMariaLeon · web-flow · commit c73c1c8cc34f · 2023-03-18T10:36:45.000-04:00
diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst
@@ -31,20 +31,20 @@ following:
     * Filling NAs within groups with a value derived from each group.
 
 * **Filtration**: discard some groups, according to a group-wise computation
-  that evaluates True or False. Some examples:
+  that evaluates to True or False. Some examples:
 
-    * Discard data that belongs to groups with only a few members.
+    * Discard data that belong to groups with only a few members.
     * Filter out data based on the group sum or mean.
 
 Many of these operations are defined on GroupBy objects. These operations are similar
-to the :ref:`aggregating API <basics.aggregate>`, :ref:`window API <window.overview>`,
-and :ref:`resample API <timeseries.aggregate>`.
+to those of the :ref:`aggregating API <basics.aggregate>`,
+:ref:`window API <window.overview>`, and :ref:`resample API <timeseries.aggregate>`.
 
 It is possible that a given operation does not fall into one of these categories or
 is some combination of them. In such a case, it may be possible to compute the
 operation using GroupBy's ``apply`` method. This method will examine the results of the
-apply step and try to return a sensibly combined result if it doesn't fit into either
-of the above two categories.
+apply step and try to sensibly combine them into a single result if it doesn't fit into either
+of the above three categories.
 
 .. note::
 
@@ -53,7 +53,7 @@ of the above two categories.
    function.
 
 
-Since the set of object instance methods on pandas data structures are generally
+Since the set of object instance methods on pandas data structures is generally
 rich and expressive, we often simply want to invoke, say, a DataFrame function
 on each group. The name GroupBy should be quite familiar to those who have used
 a SQL-based tool (or ``itertools``), in which you can write code like:
@@ -75,9 +75,9 @@ See the :ref:`cookbook<cookbook.grouping>` for some advanced strategies.
 Splitting an object into groups
 -------------------------------
 
-pandas objects can be split on any of their axes. The abstract definition of
-grouping is to provide a mapping of labels to group names. To create a GroupBy
-object (more on what the GroupBy object is later), you may do the following:
+The abstract definition of grouping is to provide a mapping of labels to
+group names. To create a GroupBy object (more on what the GroupBy object is
+later), you may do the following:
 
 .. ipython:: python
 
@@ -99,12 +99,11 @@ object (more on what the GroupBy object is later), you may do the following:
 
 The mapping can be specified many different ways:
 
-* A Python function, to be called on each of the axis labels.
+* A Python function, to be called on each of the index labels.
 * A list or NumPy array of the same length as the index.
 * A dict or ``Series``, providing a ``label -> group name`` mapping.
 * For ``DataFrame`` objects, a string indicating either a column name or
   an index level name to be used to group.
-* ``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``.
 * A list of any of the above things.
 
 Collectively we refer to the grouping objects as the **keys**. For example,
@@ -136,16 +135,20 @@ We could naturally group by either the ``A`` or ``B`` columns, or both:
    grouped = df.groupby("A")
    grouped = df.groupby(["A", "B"])
 
+.. note::
+
+   ``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``.
+
 If we also have a MultiIndex on columns ``A`` and ``B``, we can group by all
-but the specified columns
+the columns except the one we specify:
 
 .. ipython:: python
 
    df2 = df.set_index(["A", "B"])
    grouped = df2.groupby(level=df2.index.names.difference(["B"]))
    grouped.sum()
 
-These will split the DataFrame on its index (rows). To split by columns, first do
+The above GroupBy will split the DataFrame on its index (rows). To split by columns, first do
 a tranpose:
 
 .. ipython::
@@ -184,8 +187,8 @@ only verifies that you've passed a valid mapping.
 .. note::
 
    Many kinds of complicated data manipulations can be expressed in terms of
-   GroupBy operations (though can't be guaranteed to be the most
-   efficient). You can get quite creative with the label mapping functions.
+   GroupBy operations (though it can't be guaranteed to be the most efficient implementation).
+   You can get quite creative with the label mapping functions.
 
 .. _groupby.sorting:
 
@@ -245,8 +248,8 @@ The default setting of ``dropna`` argument is ``True`` which means ``NA`` are no
 GroupBy object attributes
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The ``groups`` attribute is a dict whose keys are the computed unique groups
-and corresponding values being the axis labels belonging to each group. In the
+The ``groups`` attribute is a dictionary whose keys are the computed unique groups
+and corresponding values are the axis labels belonging to each group. In the
 above example we have:
 
 .. ipython:: python
@@ -358,9 +361,10 @@ More on the ``sum`` function and aggregation later.
 
 Grouping DataFrame with Index levels and columns
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-A DataFrame may be grouped by a combination of columns and index levels by
-specifying the column names as strings and the index levels as ``pd.Grouper``
-objects.
+A DataFrame may be grouped by a combination of columns and index levels. You
+can specify both column and index names, or use a :class:`Grouper`.
+
+Let's first create a DataFrame with a MultiIndex:
 
 .. ipython:: python
 
@@ -375,8 +379,7 @@ objects.
 
    df
 
-The following example groups ``df`` by the ``second`` index level and
-the ``A`` column.
+Then we group ``df`` by the ``second`` index level and the ``A`` column.
 
 .. ipython:: python
 
@@ -398,8 +401,8 @@ DataFrame column selection in GroupBy
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Once you have created the GroupBy object from a DataFrame, you might want to do
-something different for each of the columns. Thus, using ``[]`` similar to
-getting a column from a DataFrame, you can do:
+something different for each of the columns. Thus, by using ``[]`` on the GroupBy
+object in a similar way as the one used to get a column from a DataFrame, you can do:
 
 .. ipython:: python
 
@@ -418,13 +421,13 @@ getting a column from a DataFrame, you can do:
    grouped_C = grouped["C"]
    grouped_D = grouped["D"]
 
-This is mainly syntactic sugar for the alternative and much more verbose:
+This is mainly syntactic sugar for the alternative, which is much more verbose:
 
 .. ipython:: python
 
    df["C"].groupby(df["A"])
 
-Additionally this method avoids recomputing the internal grouping information
+Additionally, this method avoids recomputing the internal grouping information
 derived from the passed key.
 
 .. _groupby.iterating-label:
@@ -1218,7 +1221,7 @@ The dimension of the returned result can also change:
 
     grouped.apply(f)
 
-``apply`` on a Series can operate on a returned value from the applied function,
+``apply`` on a Series can operate on a returned value from the applied function
 that is itself a series, and possibly upcast the result to a DataFrame:
 
 .. ipython:: python
@@ -1303,18 +1306,10 @@ column ``B`` because it is not numeric. We refer to these non-numeric columns as
    df.groupby("A").std(numeric_only=True)
 
 Note that ``df.groupby('A').colname.std().`` is more efficient than
-``df.groupby('A').std().colname``, so if the result of an aggregation function
-is only interesting over one column (here ``colname``), it may be filtered
+``df.groupby('A').std().colname``. So if the result of an aggregation function
+is only needed over one column (here ``colname``), it may be filtered
 *before* applying the aggregation function.
 
-.. note::
-   Any object column, also if it contains numerical values such as ``Decimal``
-   objects, is considered as a "nuisance" column. They are excluded from
-   aggregate functions automatically in groupby.
-
-   If you do wish to include decimal or object columns in an aggregation with
-   other non-nuisance data types, you must do so explicitly.
-
 .. ipython:: python
 
     from decimal import Decimal
@@ -1573,9 +1568,9 @@ order they are first observed.
 Plotting
 ~~~~~~~~
 
-Groupby also works with some plotting methods.  For example, suppose we
-suspect that some features in a DataFrame may differ by group, in this case,
-the values in column 1 where the group is "B" are 3 higher on average.
+Groupby also works with some plotting methods.  In this case, suppose we
+suspect that the values in column 1 are 3 times higher on average in group "B".
+
 
 .. ipython:: python
 
@@ -1657,7 +1652,7 @@ arbitrary function, for example:
 
    df.groupby(["Store", "Product"]).pipe(mean)
 
-where ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity
+Here ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity
 columns respectively for each Store-Product combination. The ``mean`` function can
 be any function that takes in a GroupBy object; the ``.pipe`` will pass the GroupBy
 object as a parameter into the function you specify.
@@ -1709,11 +1704,16 @@ Groupby by indexer to 'resample' data
 
 Resampling produces new hypothetical samples (resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.
 
-In order to resample to work on indices that are non-datetimelike, the following procedure can be utilized.
+In order for resample to work on indices that are non-datetimelike, the following procedure can be utilized.
 
 In the following examples, **df.index // 5** returns a binary array which is used to determine what gets selected for the groupby operation.
 
-.. note:: The below example shows how we can downsample by consolidation of samples into fewer samples. Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()** function, we aggregate the information contained in many samples into a small subset of values which is their standard deviation thereby reducing the number of samples.
+.. note::
+
+   The example below shows how we can downsample by consolidation of samples into fewer ones.
+   Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()**
+   function, we aggregate the information contained in many samples into a small subset of values
+   which is their standard deviation thereby reducing the number of samples.
 
 .. ipython:: python
 
@@ -1727,7 +1727,7 @@ Returning a Series to propagate names
 
 Group DataFrame columns, compute a set of metrics and return a named Series.
 The Series name is used as the name for the column index. This is especially
-useful in conjunction with reshaping operations such as stacking in which the
+useful in conjunction with reshaping operations such as stacking, in which the
 column index name will be used as the name of the inserted column:
 
 .. ipython:: python