From b7ee04919a555af5c89ef4803f9e4dd2f112ccb8 Mon Sep 17 00:00:00 2001 From: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com> Date: Wed, 8 Mar 2023 16:35:56 -0500 Subject: [PATCH] Backport PR #51704: DOC: Improve groupby in the User Guide --- doc/source/user_guide/groupby.rst | 536 ++++++++++++++++----------- doc/source/user_guide/timeseries.rst | 2 +- doc/source/whatsnew/v0.7.0.rst | 2 +- 3 files changed, 330 insertions(+), 210 deletions(-) diff --git a/doc/source/user_guide/groupby.rst b/doc/source/user_guide/groupby.rst index 15baedbac31ba..b0aafbc22562e 100644 --- a/doc/source/user_guide/groupby.rst +++ b/doc/source/user_guide/groupby.rst @@ -478,41 +478,71 @@ Or for an object grouped on multiple columns: Aggregation ----------- -Once the GroupBy object has been created, several methods are available to -perform a computation on the grouped data. These operations are similar to the -:ref:`aggregating API `, :ref:`window API `, -and :ref:`resample API `. - -An obvious one is aggregation via the -:meth:`~pandas.core.groupby.DataFrameGroupBy.aggregate` or equivalently -:meth:`~pandas.core.groupby.DataFrameGroupBy.agg` method: +An aggregation is a GroupBy operation that reduces the dimension of the grouping +object. The result of an aggregation is, or at least is treated as, +a scalar value for each column in a group. For example, producing the sum of each +column in a group of values. .. ipython:: python - grouped = df.groupby("A") - grouped[["C", "D"]].aggregate(np.sum) - - grouped = df.groupby(["A", "B"]) - grouped.aggregate(np.sum) + animals = pd.DataFrame( + { + "kind": ["cat", "dog", "cat", "dog"], + "height": [9.1, 6.0, 9.5, 34.0], + "weight": [7.9, 7.5, 9.9, 198.0], + } + ) + animals + animals.groupby("kind").sum() -As you can see, the result of the aggregation will have the group names as the -new index along the grouped axis. In the case of multiple keys, the result is a -:ref:`MultiIndex ` by default, though this can be -changed by using the ``as_index`` option: +In the result, the keys of the groups appear in the index by default. They can be +instead included in the columns by passing ``as_index=False``. .. ipython:: python - grouped = df.groupby(["A", "B"], as_index=False) - grouped.aggregate(np.sum) + animals.groupby("kind", as_index=False).sum() - df.groupby("A", as_index=False)[["C", "D"]].sum() +.. _groupby.aggregate.builtin: -Note that you could use the ``reset_index`` DataFrame function to achieve the -same result as the column names are stored in the resulting ``MultiIndex``: +Built-in aggregation methods +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. ipython:: python +Many common aggregations are built-in to GroupBy objects as methods. Of the methods +listed below, those with a ``*`` do *not* have a Cython-optimized implementation. - df.groupby(["A", "B"]).sum().reset_index() +.. csv-table:: + :header: "Method", "Description" + :widths: 20, 80 + :delim: ; + + :meth:`~.DataFrameGroupBy.any`;Compute whether any of the values in the groups are truthy + :meth:`~.DataFrameGroupBy.all`;Compute whether all of the values in the groups are truthy + :meth:`~.DataFrameGroupBy.count`;Compute the number of non-NA values in the groups + :meth:`~.DataFrameGroupBy.cov` * ;Compute the covariance of the groups + :meth:`~.DataFrameGroupBy.first`;Compute the first occurring value in each group + :meth:`~.DataFrameGroupBy.idxmax` *;Compute the index of the maximum value in each group + :meth:`~.DataFrameGroupBy.idxmin` *;Compute the index of the minimum value in each group + :meth:`~.DataFrameGroupBy.last`;Compute the last occurring value in each group + :meth:`~.DataFrameGroupBy.max`;Compute the maximum value in each group + :meth:`~.DataFrameGroupBy.mean`;Compute the mean of each group + :meth:`~.DataFrameGroupBy.median`;Compute the median of each group + :meth:`~.DataFrameGroupBy.min`;Compute the minimum value in each group + :meth:`~.DataFrameGroupBy.nunique`;Compute the number of unique values in each group + :meth:`~.DataFrameGroupBy.prod`;Compute the product of the values in each group + :meth:`~.DataFrameGroupBy.quantile`;Compute a given quantile of the values in each group + :meth:`~.DataFrameGroupBy.sem`;Compute the standard error of the mean of the values in each group + :meth:`~.DataFrameGroupBy.size`;Compute the number of values in each group + :meth:`~.DataFrameGroupBy.skew` *;Compute the skew of the values in each group + :meth:`~.DataFrameGroupBy.std`;Compute the standard deviation of the values in each group + :meth:`~.DataFrameGroupBy.sum`;Compute the sum of the values in each group + :meth:`~.DataFrameGroupBy.var`;Compute the variance of the values in each group + +Some examples: + +.. ipython:: python + + df.groupby("A")[["C", "D"]].max() + df.groupby(["A", "B"]).mean() Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the ``size`` method. It returns a Series whose @@ -520,13 +550,20 @@ index are the group names and whose values are the sizes of each group. .. ipython:: python + grouped = df.groupby(["A", "B"]) grouped.size() +While the :meth:`~.DataFrameGroupBy.describe` method is not itself a reducer, it +can be used to conveniently produce a collection of summary statistics about each of +the groups. + .. ipython:: python grouped.describe() -Another aggregation example is to compute the number of unique values of each group. This is similar to the ``value_counts`` function, except that it only counts unique values. +Another aggregation example is to compute the number of unique values of each group. +This is similar to the ``value_counts`` function, except that it only counts the +number of unique values. .. ipython:: python @@ -538,40 +575,84 @@ Another aggregation example is to compute the number of unique values of each gr .. note:: Aggregation functions **will not** return the groups that you are aggregating over - if they are named *columns*, when ``as_index=True``, the default. The grouped columns will + as named *columns*, when ``as_index=True``, the default. The grouped columns will be the **indices** of the returned object. Passing ``as_index=False`` **will** return the groups that you are aggregating over, if they are - named *columns*. + named **indices** or *columns*. -Aggregating functions are the ones that reduce the dimension of the returned objects. -Some common aggregating functions are tabulated below: -.. csv-table:: - :header: "Function", "Description" - :widths: 20, 80 - :delim: ; +.. _groupby.aggregate.agg: + +The :meth:`~.DataFrameGroupBy.aggregate` method +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + The :meth:`~.DataFrameGroupBy.aggregate` method can accept many different types of + inputs. This section details using string aliases for various GroupBy methods; other + inputs are detailed in the sections below. + +Any reduction method that pandas implements can be passed as a string to +:meth:`~.DataFrameGroupBy.aggregate`. Users are encouraged to use the shorthand, +``agg``. It will operate as if the corresponding method was called. + +.. ipython:: python - :meth:`~pd.core.groupby.DataFrameGroupBy.mean`;Compute mean of groups - :meth:`~pd.core.groupby.DataFrameGroupBy.sum`;Compute sum of group values - :meth:`~pd.core.groupby.DataFrameGroupBy.size`;Compute group sizes - :meth:`~pd.core.groupby.DataFrameGroupBy.count`;Compute count of group - :meth:`~pd.core.groupby.DataFrameGroupBy.std`;Standard deviation of groups - :meth:`~pd.core.groupby.DataFrameGroupBy.var`;Compute variance of groups - :meth:`~pd.core.groupby.DataFrameGroupBy.sem`;Standard error of the mean of groups - :meth:`~pd.core.groupby.DataFrameGroupBy.describe`;Generates descriptive statistics - :meth:`~pd.core.groupby.DataFrameGroupBy.first`;Compute first of group values - :meth:`~pd.core.groupby.DataFrameGroupBy.last`;Compute last of group values - :meth:`~pd.core.groupby.DataFrameGroupBy.nth`;Take nth value, or a subset if n is a list - :meth:`~pd.core.groupby.DataFrameGroupBy.min`;Compute min of group values - :meth:`~pd.core.groupby.DataFrameGroupBy.max`;Compute max of group values - - -The aggregating functions above will exclude NA values. Any function which -reduces a :class:`Series` to a scalar value is an aggregation function and will work, -a trivial example is ``df.groupby('A').agg(lambda ser: 1)``. Note that -:meth:`~pd.core.groupby.DataFrameGroupBy.nth` can act as a reducer *or* a -filter, see :ref:`here `. + grouped = df.groupby("A") + grouped[["C", "D"]].aggregate("sum") + + grouped = df.groupby(["A", "B"]) + grouped.agg("sum") + +The result of the aggregation will have the group names as the +new index along the grouped axis. In the case of multiple keys, the result is a +:ref:`MultiIndex ` by default. As mentioned above, this can be +changed by using the ``as_index`` option: + +.. ipython:: python + + grouped = df.groupby(["A", "B"], as_index=False) + grouped.agg("sum") + + df.groupby("A", as_index=False)[["C", "D"]].agg("sum") + +Note that you could use the :meth:`DataFrame.reset_index` DataFrame function to achieve +the same result as the column names are stored in the resulting ``MultiIndex``, although +this will make an extra copy. + +.. ipython:: python + + df.groupby(["A", "B"]).agg("sum").reset_index() + +.. _groupby.aggregate.udf: + +Aggregation with User-Defined Functions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Users can also provide their own User-Defined Functions (UDFs) for custom aggregations. + +.. warning:: + + When aggregating with a UDF, the UDF should not mutate the + provided ``Series``. See :ref:`gotchas.udf-mutation` for more information. + +.. note:: + + Aggregating with a UDF is often less performant than using + the pandas built-in methods on GroupBy. Consider breaking up a complex operation + into a chain of operations that utilize the built-in methods. + +.. ipython:: python + + animals + animals.groupby("kind")[["height"]].agg(lambda x: set(x)) + +The resulting dtype will reflect that of the aggregating function. If the results from different groups have +different dtypes, then a common dtype will be determined in the same way as ``DataFrame`` construction. + +.. ipython:: python + + animals.groupby("kind")[["height"]].agg(lambda x: x.astype(int).sum()) .. _groupby.aggregate.multifunc: @@ -584,24 +665,24 @@ aggregation with, outputting a DataFrame: .. ipython:: python grouped = df.groupby("A") - grouped["C"].agg([np.sum, np.mean, np.std]) + grouped["C"].agg(["sum", "mean", "std"]) On a grouped ``DataFrame``, you can pass a list of functions to apply to each column, which produces an aggregated result with a hierarchical index: .. ipython:: python - grouped[["C", "D"]].agg([np.sum, np.mean, np.std]) + grouped[["C", "D"]].agg(["sum", "mean", "std"]) -The resulting aggregations are named for the functions themselves. If you +The resulting aggregations are named after the functions themselves. If you need to rename, then you can add in a chained operation for a ``Series`` like this: .. ipython:: python ( grouped["C"] - .agg([np.sum, np.mean, np.std]) + .agg(["sum", "mean", "std"]) .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"}) ) @@ -610,24 +691,23 @@ For a grouped ``DataFrame``, you can rename in a similar manner: .. ipython:: python ( - grouped[["C", "D"]].agg([np.sum, np.mean, np.std]).rename( + grouped[["C", "D"]].agg(["sum", "mean", "std"]).rename( columns={"sum": "foo", "mean": "bar", "std": "baz"} ) ) .. note:: - In general, the output column names should be unique. You can't apply - the same function (or two functions with the same name) to the same + In general, the output column names should be unique, but pandas will allow + you apply to the same function (or two functions with the same name) to the same column. .. ipython:: python - :okexcept: grouped["C"].agg(["sum", "sum"]) - pandas *does* allow you to provide multiple lambdas. In this case, pandas + pandas also allows you to provide multiple lambdas. In this case, pandas will mangle the name of the (nameless) lambda functions, appending ``_`` to each subsequent lambda. @@ -636,72 +716,58 @@ For a grouped ``DataFrame``, you can rename in a similar manner: grouped["C"].agg([lambda x: x.max() - x.min(), lambda x: x.median() - x.mean()]) - .. _groupby.aggregate.named: Named aggregation ~~~~~~~~~~~~~~~~~ To support column-specific aggregation *with control over the output column names*, pandas -accepts the special syntax in :meth:`DataFrameGroupBy.agg` and :meth:`SeriesGroupBy.agg`, known as "named aggregation", where +accepts the special syntax in :meth:`.DataFrameGroupBy.agg` and :meth:`.SeriesGroupBy.agg`, known as "named aggregation", where - The keywords are the *output* column names - The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column. pandas - provides the ``pandas.NamedAgg`` namedtuple with the fields ``['column', 'aggfunc']`` + provides the :class:`NamedAgg` namedtuple with the fields ``['column', 'aggfunc']`` to make it clearer what the arguments are. As usual, the aggregation can be a callable or a string alias. .. ipython:: python - animals = pd.DataFrame( - { - "kind": ["cat", "dog", "cat", "dog"], - "height": [9.1, 6.0, 9.5, 34.0], - "weight": [7.9, 7.5, 9.9, 198.0], - } - ) animals animals.groupby("kind").agg( min_height=pd.NamedAgg(column="height", aggfunc="min"), max_height=pd.NamedAgg(column="height", aggfunc="max"), - average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean), + average_weight=pd.NamedAgg(column="weight", aggfunc="mean"), ) -``pandas.NamedAgg`` is just a ``namedtuple``. Plain tuples are allowed as well. +:class:`NamedAgg` is just a ``namedtuple``. Plain tuples are allowed as well. .. ipython:: python animals.groupby("kind").agg( min_height=("height", "min"), max_height=("height", "max"), - average_weight=("weight", np.mean), + average_weight=("weight", "mean"), ) -If your desired output column names are not valid Python keywords, construct a dictionary +If the column names you want are not valid Python keywords, construct a dictionary and unpack the keyword arguments .. ipython:: python animals.groupby("kind").agg( **{ - "total weight": pd.NamedAgg(column="weight", aggfunc=sum) + "total weight": pd.NamedAgg(column="weight", aggfunc="sum") } ) -Additional keyword arguments are not passed through to the aggregation functions. Only pairs +When using named aggregation, additional keyword arguments are not passed through +to the aggregation functions; only pairs of ``(column, aggfunc)`` should be passed as ``**kwargs``. If your aggregation functions -requires additional arguments, partially apply them with :meth:`functools.partial`. - -.. note:: - - For Python 3.5 and earlier, the order of ``**kwargs`` in a functions was not - preserved. This means that the output column ordering would not be - consistent. To ensure consistent ordering, the keys (and so output columns) - will always be sorted for Python 3.5. +require additional arguments, apply them partially with :meth:`functools.partial`. Named aggregation is also valid for Series groupby aggregations. In this case there's no column selection, so the values are just the functions. @@ -721,59 +787,98 @@ columns of a DataFrame: .. ipython:: python - grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)}) + grouped.agg({"C": "sum", "D": lambda x: np.std(x, ddof=1)}) The function names can also be strings. In order for a string to be valid it -must be either implemented on GroupBy or available via :ref:`dispatching -`: +must be implemented on GroupBy: .. ipython:: python grouped.agg({"C": "sum", "D": "std"}) -.. _groupby.aggregate.cython: +.. _groupby.transform: -Cython-optimized aggregation functions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Transformation +-------------- -Some common aggregations, currently only ``sum``, ``mean``, ``std``, and ``sem``, have -optimized Cython implementations: +A transformation is a GroupBy operation whose result is indexed the same +as the one being grouped. Common examples include :meth:`~.DataFrameGroupBy.cumsum` and +:meth:`~.DataFrameGroupBy.diff`. .. ipython:: python - df.groupby("A")[["C", "D"]].sum() - df.groupby(["A", "B"]).mean() + speeds + grouped = speeds.groupby("class")["max_speed"] + grouped.cumsum() + grouped.diff() -Of course ``sum`` and ``mean`` are implemented on pandas objects, so the above -code would work even without the special versions via dispatching (see below). +Unlike aggregations, the groupings that are used to split +the original object are not included in the result. -.. _groupby.aggregate.udfs: +.. note:: -Aggregations with User-Defined Functions -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + Since transformations do not include the groupings that are used to split the result, + the arguments ``as_index`` and ``sort`` in :meth:`DataFrame.groupby` and + :meth:`Series.groupby` have no effect. -Users can also provide their own functions for custom aggregations. When aggregating -with a User-Defined Function (UDF), the UDF should not mutate the provided ``Series``, see -:ref:`gotchas.udf-mutation` for more information. +A common use of a transformation is to add the result back into the original DataFrame. .. ipython:: python - animals.groupby("kind")[["height"]].agg(lambda x: set(x)) + result = speeds.copy() + result["cumsum"] = grouped.cumsum() + result["diff"] = grouped.diff() + result -The resulting dtype will reflect that of the aggregating function. If the results from different groups have -different dtypes, then a common dtype will be determined in the same way as ``DataFrame`` construction. +Built-in transformation methods +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. ipython:: python +The following methods on GroupBy act as transformations. Of these methods, only +``fillna`` does not have a Cython-optimized implementation. - animals.groupby("kind")[["height"]].agg(lambda x: x.astype(int).sum()) +.. csv-table:: + :header: "Method", "Description" + :widths: 20, 80 + :delim: ; -.. _groupby.transform: + :meth:`~.DataFrameGroupBy.bfill`;Back fill NA values within each group + :meth:`~.DataFrameGroupBy.cumcount`;Compute the cumulative count within each group + :meth:`~.DataFrameGroupBy.cummax`;Compute the cumulative max within each group + :meth:`~.DataFrameGroupBy.cummin`;Compute the cumulative min within each group + :meth:`~.DataFrameGroupBy.cumprod`;Compute the cumulative product within each group + :meth:`~.DataFrameGroupBy.cumsum`;Compute the cumulative sum within each group + :meth:`~.DataFrameGroupBy.diff`;Compute the difference between adjacent values within each group + :meth:`~.DataFrameGroupBy.ffill`;Forward fill NA values within each group + :meth:`~.DataFrameGroupBy.fillna`;Fill NA values within each group + :meth:`~.DataFrameGroupBy.pct_change`;Compute the percent change between adjacent values within each group + :meth:`~.DataFrameGroupBy.rank`;Compute the rank of each value within each group + :meth:`~.DataFrameGroupBy.shift`;Shift values up or down within each group -Transformation --------------- +In addition, passing any built-in aggregation method as a string to +:meth:`~.DataFrameGroupBy.transform` (see the next section) will broadcast the result +across the group, producing a transformed result. If the aggregation method is +Cython-optimized, this will be performant as well. + +.. _groupby.transformation.transform: -The ``transform`` method returns an object that is indexed the same -as the one being grouped. The transform function must: +The :meth:`~.DataFrameGroupBy.transform` method +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Similar to the :ref:`aggregation method `, the +:meth:`~.DataFrameGroupBy.transform` method can accept string aliases to the built-in +transformation methods in the previous section. It can *also* accept string aliases to +the built-in aggregation methods. When an aggregation method is provided, the result +will be broadcast across the group. + +.. ipython:: python + + speeds + grouped = speeds.groupby("class")[["max_speed"]] + grouped.transform("cumsum") + grouped.transform("sum") + +In addition to string aliases, the :meth:`~.DataFrameGroupBy.transform` method can +also except User-Defined functions (UDFs). The UDF must: * Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, @@ -782,22 +887,33 @@ as the one being grouped. The transform function must: the first group chunk using chunk.apply. * Not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected - results. -* (Optionally) operates on the entire group chunk. If this is supported, a - fast path is used starting from the *second* chunk. + results. See :ref:`gotchas.udf-mutation` for more information. +* (Optionally) operates on all columns of the entire group chunk at once. If this is + supported, a fast path is used starting from the *second* chunk. + +.. note:: + + Transforming by supplying ``transform`` with a UDF is + often less performant than using the built-in methods on GroupBy. + Consider breaking up a complex operation into a chain of operations that utilize + the built-in methods. + + All of the examples in this section can be made more performant by calling + built-in methods instead of using ``transform``. + See :ref:`below for examples `. .. versionchanged:: 2.0.0 When using ``.transform`` on a grouped DataFrame and the transformation function returns a DataFrame, pandas now aligns the result's index - with the input's index. You can call ``.to_numpy()`` on the - result of the transformation function to avoid alignment. + with the input's index. You can call ``.to_numpy()`` within the transformation + function to avoid alignment. -Similar to :ref:`groupby.aggregate.udfs`, the resulting dtype will reflect that of the +Similar to :ref:`groupby.aggregate.agg`, the resulting dtype will reflect that of the transformation function. If the results from different groups have different dtypes, then a common dtype will be determined in the same way as ``DataFrame`` construction. -Suppose we wished to standardize the data within each group: +Suppose we wish to standardize the data within each group: .. ipython:: python @@ -844,15 +960,6 @@ match the shape of the input array. ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min()) -Alternatively, the built-in methods could be used to produce the same outputs. - -.. ipython:: python - - max_ts = ts.groupby(lambda x: x.year).transform("max") - min_ts = ts.groupby(lambda x: x.year).transform("min") - - max_ts - min_ts - Another common data transform is to replace missing data with the group mean. .. ipython:: python @@ -879,7 +986,7 @@ Another common data transform is to replace missing data with the group mean. transformed = grouped.transform(lambda x: x.fillna(x.mean())) -We can verify that the group means have not changed in the transformed data +We can verify that the group means have not changed in the transformed data, and that the transformed data contains no NAs. .. ipython:: python @@ -893,18 +1000,28 @@ and that the transformed data contains no NAs. grouped_trans.count() # counts after transformation grouped_trans.size() # Verify non-NA count equals group size -.. note:: +.. _groupby_efficient_transforms: - Some functions will automatically transform the input when applied to a - GroupBy object, but returning an object of the same shape as the original. - Passing ``as_index=False`` will not affect these transformation methods. +As mentioned in the note above, each of the examples in this section can be computed +more efficiently using built-in methods. In the code below, the inefficient way +using a UDF is commented out and the faster alternative appears below. - For example: ``fillna, ffill, bfill, shift.``. +.. ipython:: python - .. ipython:: python + # ts.groupby(lambda x: x.year).transform( + # lambda x: (x - x.mean()) / x.std() + # ) + grouped = ts.groupby(lambda x: x.year) + result = (ts - grouped.transform("mean")) / grouped.transform("std") - grouped.ffill() + # ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min()) + grouped = ts.groupby(lambda x: x.year) + result = grouped.transform("max") - grouped.transform("min") + # grouped = data_df.groupby(key) + # grouped.transform(lambda x: x.fillna(x.mean())) + grouped = data_df.groupby(key) + result = data_df.fillna(grouped.transform("mean")) .. _groupby.transform.window_resample: @@ -915,7 +1032,7 @@ It is possible to use ``resample()``, ``expanding()`` and ``rolling()`` as methods on groupbys. The example below will apply the ``rolling()`` method on the samples of -the column B based on the groups of column A. +the column B, based on the groups of column A. .. ipython:: python @@ -935,7 +1052,7 @@ group. Suppose you want to use the ``resample()`` method to get a daily -frequency in each group of your dataframe and wish to complete the +frequency in each group of your dataframe, and wish to complete the missing values with the ``ffill()`` method. .. ipython:: python @@ -956,109 +1073,111 @@ missing values with the ``ffill()`` method. Filtration ---------- -The ``filter`` method returns a subset of the original object. Suppose we -want to take only elements that belong to groups with a group sum greater -than 2. +A filtration is a GroupBy operation the subsets the original grouping object. It +may either filter out entire groups, part of groups, or both. Filtrations return +a filtered version of the calling object, including the grouping columns when provided. +In the following example, ``class`` is included in the result. .. ipython:: python - sf = pd.Series([1, 1, 2, 3, 3, 3]) - sf.groupby(sf).filter(lambda x: x.sum() > 2) - -The argument of ``filter`` must be a function that, applied to the group as a -whole, returns ``True`` or ``False``. + speeds + speeds.groupby("class").nth(1) -Another useful operation is filtering out elements that belong to groups -with only a couple members. +.. note:: -.. ipython:: python + Unlike aggregations, filtrations do not add the group keys to the index of the + result. Because of this, passing ``as_index=False`` or ``sort=True`` will not + affect these methods. - dff = pd.DataFrame({"A": np.arange(8), "B": list("aabbbbcc")}) - dff.groupby("B").filter(lambda x: len(x) > 2) - -Alternatively, instead of dropping the offending groups, we can return a -like-indexed objects where the groups that do not pass the filter are filled -with NaNs. +Filtrations will respect subsetting the columns of the GroupBy object. .. ipython:: python - dff.groupby("B").filter(lambda x: len(x) > 2, dropna=False) + speeds.groupby("class")[["order", "max_speed"]].nth(1) -For DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion. +Built-in filtrations +~~~~~~~~~~~~~~~~~~~~ -.. ipython:: python +The following methods on GroupBy act as filtrations. All these methods have a +Cython-optimized implementation. - dff["C"] = np.arange(8) - dff.groupby("B").filter(lambda x: len(x["C"]) > 2) +.. csv-table:: + :header: "Method", "Description" + :widths: 20, 80 + :delim: ; -.. note:: + :meth:`~.DataFrameGroupBy.head`;Select the top row(s) of each group + :meth:`~.DataFrameGroupBy.nth`;Select the nth row(s) of each group + :meth:`~.DataFrameGroupBy.tail`;Select the bottom row(s) of each group - Some functions when applied to a groupby object will act as a **filter** on the input, returning - a reduced shape of the original (and potentially eliminating groups), but with the index unchanged. - Passing ``as_index=False`` will not affect these transformation methods. +Users can also use transformations along with Boolean indexing to construct complex +filtrations within groups. For example, suppose we are given groups of products and +their volumes, and we wish to subset the data to only the largest products capturing no +more than 90% of the total volume within each group. - For example: ``head, tail``. +.. ipython:: python - .. ipython:: python + product_volumes = pd.DataFrame( + { + "group": list("xxxxyyy"), + "product": list("abcdefg"), + "volume": [10, 30, 20, 15, 40, 10, 20], + } + ) + product_volumes - dff.groupby("B").head(2) + # Sort by volume to select the largest products first + product_volumes = product_volumes.sort_values("volume", ascending=False) + grouped = product_volumes.groupby("group")["volume"] + cumpct = grouped.cumsum() / grouped.transform("sum") + cumpct + significant_products = product_volumes[cumpct <= 0.9] + significant_products.sort_values(["group", "product"]) +The :class:`~DataFrameGroupBy.filter` method +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -.. _groupby.dispatch: +.. note:: -Dispatching to instance methods -------------------------------- + Filtering by supplying ``filter`` with a User-Defined Function (UDF) is + often less performant than using the built-in methods on GroupBy. + Consider breaking up a complex operation into a chain of operations that utilize + the built-in methods. -When doing an aggregation or transformation, you might just want to call an -instance method on each data group. This is pretty easy to do by passing lambda -functions: +The ``filter`` method takes a User-Defined Function (UDF) that, when applied to +an entire group, returns either ``True`` or ``False``. The result of the ``filter`` +method is then the subset of groups for which the UDF returned ``True``. + +Suppose we want to take only elements that belong to groups with a group sum greater +than 2. .. ipython:: python - :okwarning: - grouped = df.groupby("A")[["C", "D"]] - grouped.agg(lambda x: x.std()) + sf = pd.Series([1, 1, 2, 3, 3, 3]) + sf.groupby(sf).filter(lambda x: x.sum() > 2) -But, it's rather verbose and can be untidy if you need to pass additional -arguments. Using a bit of metaprogramming cleverness, GroupBy now has the -ability to "dispatch" method calls to the groups: +Another useful operation is filtering out elements that belong to groups +with only a couple members. .. ipython:: python - :okwarning: - grouped.std() + dff = pd.DataFrame({"A": np.arange(8), "B": list("aabbbbcc")}) + dff.groupby("B").filter(lambda x: len(x) > 2) -What is actually happening here is that a function wrapper is being -generated. When invoked, it takes any passed arguments and invokes the function -with any arguments on each group (in the above example, the ``std`` -function). The results are then combined together much in the style of ``agg`` -and ``transform`` (it actually uses ``apply`` to infer the gluing, documented -next). This enables some operations to be carried out rather succinctly: +Alternatively, instead of dropping the offending groups, we can return a +like-indexed objects where the groups that do not pass the filter are filled +with NaNs. .. ipython:: python - tsdf = pd.DataFrame( - np.random.randn(1000, 3), - index=pd.date_range("1/1/2000", periods=1000), - columns=["A", "B", "C"], - ) - tsdf.iloc[::2] = np.nan - grouped = tsdf.groupby(lambda x: x.year) - grouped.fillna(method="pad") - -In this example, we chopped the collection of time series into yearly chunks -then independently called :ref:`fillna ` on the -groups. + dff.groupby("B").filter(lambda x: len(x) > 2, dropna=False) -The ``nlargest`` and ``nsmallest`` methods work on ``Series`` style groupbys: +For DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion. .. ipython:: python - s = pd.Series([9, 8, 7, 5, 19, 1, 4.2, 3.3]) - g = pd.Series(list("abababab")) - gb = s.groupby(g) - gb.nlargest(3) - gb.nsmallest(3) + dff["C"] = np.arange(8) + dff.groupby("B").filter(lambda x: len(x["C"]) > 2) .. _groupby.apply: @@ -1114,7 +1233,7 @@ that is itself a series, and possibly upcast the result to a DataFrame: s s.apply(f) -Similar to :ref:`groupby.aggregate.udfs`, the resulting dtype will reflect that of the +Similar to :ref:`groupby.aggregate.agg`, the resulting dtype will reflect that of the apply function. If the results from different groups have different dtypes, then a common dtype will be determined in the same way as ``DataFrame`` construction. @@ -1144,6 +1263,7 @@ with df.groupby("A", group_keys=False).apply(lambda x: x) + Numba Accelerated Routines -------------------------- diff --git a/doc/source/user_guide/timeseries.rst b/doc/source/user_guide/timeseries.rst index a675e30823c89..4cd98c89e7180 100644 --- a/doc/source/user_guide/timeseries.rst +++ b/doc/source/user_guide/timeseries.rst @@ -1618,7 +1618,7 @@ The ``resample`` function is very flexible and allows you to specify many different parameters to control the frequency conversion and resampling operation. -Any function available via :ref:`dispatching ` is available as +Any built-in method available via :ref:`GroupBy ` is available as a method of the returned object, including ``sum``, ``mean``, ``std``, ``sem``, ``max``, ``min``, ``median``, ``first``, ``last``, ``ohlc``: diff --git a/doc/source/whatsnew/v0.7.0.rst b/doc/source/whatsnew/v0.7.0.rst index 1ee6a9899a655..2336ccaeac820 100644 --- a/doc/source/whatsnew/v0.7.0.rst +++ b/doc/source/whatsnew/v0.7.0.rst @@ -346,7 +346,7 @@ Other API changes Performance improvements ~~~~~~~~~~~~~~~~~~~~~~~~ -- :ref:`Cythonized GroupBy aggregations ` no longer +- :ref:`Cythonized GroupBy aggregations ` no longer presort the data, thus achieving a significant speedup (:issue:`93`). GroupBy aggregations with Python functions significantly sped up by clever manipulation of the ndarray data type in Cython (:issue:`496`).