diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst index 56e62ba20e030..4ec7bfeed93c1 100644 --- a/doc/source/user_guide/groupby.rst +++ b/doc/source/user_guide/groupby.rst @@ -31,20 +31,20 @@ following: * Filling NAs within groups with a value derived from each group. * **Filtration**: discard some groups, according to a group-wise computation - that evaluates True or False. Some examples: + that evaluates to True or False. Some examples: - * Discard data that belongs to groups with only a few members. + * Discard data that belong to groups with only a few members. * Filter out data based on the group sum or mean. Many of these operations are defined on GroupBy objects. These operations are similar -to the :ref:`aggregating API `, :ref:`window API `, -and :ref:`resample API `. +to those of the :ref:`aggregating API `, +:ref:`window API `, and :ref:`resample API `. It is possible that a given operation does not fall into one of these categories or is some combination of them. In such a case, it may be possible to compute the operation using GroupBy's ``apply`` method. This method will examine the results of the -apply step and try to return a sensibly combined result if it doesn't fit into either -of the above two categories. +apply step and try to sensibly combine them into a single result if it doesn't fit into either +of the above three categories. .. note:: @@ -53,7 +53,7 @@ of the above two categories. function. -Since the set of object instance methods on pandas data structures are generally +Since the set of object instance methods on pandas data structures is generally rich and expressive, we often simply want to invoke, say, a DataFrame function on each group. The name GroupBy should be quite familiar to those who have used a SQL-based tool (or ``itertools``), in which you can write code like: @@ -75,9 +75,9 @@ See the :ref:`cookbook` for some advanced strategies. Splitting an object into groups ------------------------------- -pandas objects can be split on any of their axes. The abstract definition of -grouping is to provide a mapping of labels to group names. To create a GroupBy -object (more on what the GroupBy object is later), you may do the following: +The abstract definition of grouping is to provide a mapping of labels to +group names. To create a GroupBy object (more on what the GroupBy object is +later), you may do the following: .. ipython:: python @@ -99,12 +99,11 @@ object (more on what the GroupBy object is later), you may do the following: The mapping can be specified many different ways: -* A Python function, to be called on each of the axis labels. +* A Python function, to be called on each of the index labels. * A list or NumPy array of the same length as the index. * A dict or ``Series``, providing a ``label -> group name`` mapping. * For ``DataFrame`` objects, a string indicating either a column name or an index level name to be used to group. -* ``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``. * A list of any of the above things. Collectively we refer to the grouping objects as the **keys**. For example, @@ -136,8 +135,12 @@ We could naturally group by either the ``A`` or ``B`` columns, or both: grouped = df.groupby("A") grouped = df.groupby(["A", "B"]) +.. note:: + + ``df.groupby('A')`` is just syntactic sugar for ``df.groupby(df['A'])``. + If we also have a MultiIndex on columns ``A`` and ``B``, we can group by all -but the specified columns +the columns except the one we specify: .. ipython:: python @@ -145,7 +148,7 @@ but the specified columns grouped = df2.groupby(level=df2.index.names.difference(["B"])) grouped.sum() -These will split the DataFrame on its index (rows). To split by columns, first do +The above GroupBy will split the DataFrame on its index (rows). To split by columns, first do a tranpose: .. ipython:: @@ -184,8 +187,8 @@ only verifies that you've passed a valid mapping. .. note:: Many kinds of complicated data manipulations can be expressed in terms of - GroupBy operations (though can't be guaranteed to be the most - efficient). You can get quite creative with the label mapping functions. + GroupBy operations (though it can't be guaranteed to be the most efficient implementation). + You can get quite creative with the label mapping functions. .. _groupby.sorting: @@ -245,8 +248,8 @@ The default setting of ``dropna`` argument is ``True`` which means ``NA`` are no GroupBy object attributes ~~~~~~~~~~~~~~~~~~~~~~~~~ -The ``groups`` attribute is a dict whose keys are the computed unique groups -and corresponding values being the axis labels belonging to each group. In the +The ``groups`` attribute is a dictionary whose keys are the computed unique groups +and corresponding values are the axis labels belonging to each group. In the above example we have: .. ipython:: python @@ -358,9 +361,10 @@ More on the ``sum`` function and aggregation later. Grouping DataFrame with Index levels and columns ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -A DataFrame may be grouped by a combination of columns and index levels by -specifying the column names as strings and the index levels as ``pd.Grouper`` -objects. +A DataFrame may be grouped by a combination of columns and index levels. You +can specify both column and index names, or use a :class:`Grouper`. + +Let's first create a DataFrame with a MultiIndex: .. ipython:: python @@ -375,8 +379,7 @@ objects. df -The following example groups ``df`` by the ``second`` index level and -the ``A`` column. +Then we group ``df`` by the ``second`` index level and the ``A`` column. .. ipython:: python @@ -398,8 +401,8 @@ DataFrame column selection in GroupBy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once you have created the GroupBy object from a DataFrame, you might want to do -something different for each of the columns. Thus, using ``[]`` similar to -getting a column from a DataFrame, you can do: +something different for each of the columns. Thus, by using ``[]`` on the GroupBy +object in a similar way as the one used to get a column from a DataFrame, you can do: .. ipython:: python @@ -418,13 +421,13 @@ getting a column from a DataFrame, you can do: grouped_C = grouped["C"] grouped_D = grouped["D"] -This is mainly syntactic sugar for the alternative and much more verbose: +This is mainly syntactic sugar for the alternative, which is much more verbose: .. ipython:: python df["C"].groupby(df["A"]) -Additionally this method avoids recomputing the internal grouping information +Additionally, this method avoids recomputing the internal grouping information derived from the passed key. .. _groupby.iterating-label: @@ -1218,7 +1221,7 @@ The dimension of the returned result can also change: grouped.apply(f) -``apply`` on a Series can operate on a returned value from the applied function, +``apply`` on a Series can operate on a returned value from the applied function that is itself a series, and possibly upcast the result to a DataFrame: .. ipython:: python @@ -1303,18 +1306,10 @@ column ``B`` because it is not numeric. We refer to these non-numeric columns as df.groupby("A").std(numeric_only=True) Note that ``df.groupby('A').colname.std().`` is more efficient than -``df.groupby('A').std().colname``, so if the result of an aggregation function -is only interesting over one column (here ``colname``), it may be filtered +``df.groupby('A').std().colname``. So if the result of an aggregation function +is only needed over one column (here ``colname``), it may be filtered *before* applying the aggregation function. -.. note:: - Any object column, also if it contains numerical values such as ``Decimal`` - objects, is considered as a "nuisance" column. They are excluded from - aggregate functions automatically in groupby. - - If you do wish to include decimal or object columns in an aggregation with - other non-nuisance data types, you must do so explicitly. - .. ipython:: python from decimal import Decimal @@ -1573,9 +1568,9 @@ order they are first observed. Plotting ~~~~~~~~ -Groupby also works with some plotting methods. For example, suppose we -suspect that some features in a DataFrame may differ by group, in this case, -the values in column 1 where the group is "B" are 3 higher on average. +Groupby also works with some plotting methods. In this case, suppose we +suspect that the values in column 1 are 3 times higher on average in group "B". + .. ipython:: python @@ -1657,7 +1652,7 @@ arbitrary function, for example: df.groupby(["Store", "Product"]).pipe(mean) -where ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity +Here ``mean`` takes a GroupBy object and finds the mean of the Revenue and Quantity columns respectively for each Store-Product combination. The ``mean`` function can be any function that takes in a GroupBy object; the ``.pipe`` will pass the GroupBy object as a parameter into the function you specify. @@ -1709,11 +1704,16 @@ Groupby by indexer to 'resample' data Resampling produces new hypothetical samples (resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples. -In order to resample to work on indices that are non-datetimelike, the following procedure can be utilized. +In order for resample to work on indices that are non-datetimelike, the following procedure can be utilized. In the following examples, **df.index // 5** returns a binary array which is used to determine what gets selected for the groupby operation. -.. note:: The below example shows how we can downsample by consolidation of samples into fewer samples. Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()** function, we aggregate the information contained in many samples into a small subset of values which is their standard deviation thereby reducing the number of samples. +.. note:: + + The example below shows how we can downsample by consolidation of samples into fewer ones. + Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()** + function, we aggregate the information contained in many samples into a small subset of values + which is their standard deviation thereby reducing the number of samples. .. ipython:: python @@ -1727,7 +1727,7 @@ Returning a Series to propagate names Group DataFrame columns, compute a set of metrics and return a named Series. The Series name is used as the name for the column index. This is especially -useful in conjunction with reshaping operations such as stacking in which the +useful in conjunction with reshaping operations such as stacking, in which the column index name will be used as the name of the inserted column: .. ipython:: python