Skip to content

Commit 8c7ddbe

Browse files
dsm054TomAugspurger
authored andcommitted
ENH: add .ngroup() method to groupby objects (#14026) (#14026)
(cherry picked from commit 72e0d1f)
1 parent 627d2c6 commit 8c7ddbe

File tree

8 files changed

+338
-63
lines changed

8 files changed

+338
-63
lines changed

doc/source/api.rst

+1
Original file line numberDiff line numberDiff line change
@@ -1707,6 +1707,7 @@ Computations / Descriptive Stats
17071707
GroupBy.mean
17081708
GroupBy.median
17091709
GroupBy.min
1710+
GroupBy.ngroup
17101711
GroupBy.nth
17111712
GroupBy.ohlc
17121713
GroupBy.prod

doc/source/groupby.rst

+57-6
Original file line numberDiff line numberDiff line change
@@ -1122,12 +1122,36 @@ To see the order in which each row appears within its group, use the
11221122

11231123
.. ipython:: python
11241124
1125-
df = pd.DataFrame(list('aaabba'), columns=['A'])
1126-
df
1125+
dfg = pd.DataFrame(list('aaabba'), columns=['A'])
1126+
dfg
1127+
1128+
dfg.groupby('A').cumcount()
1129+
1130+
dfg.groupby('A').cumcount(ascending=False)
1131+
1132+
.. _groupby.ngroup:
1133+
1134+
Enumerate groups
1135+
~~~~~~~~~~~~~~~~
1136+
1137+
.. versionadded:: 0.20.2
1138+
1139+
To see the ordering of the groups (as opposed to the order of rows
1140+
within a group given by ``cumcount``) you can use the ``ngroup``
1141+
method.
1142+
1143+
Note that the numbers given to the groups match the order in which the
1144+
groups would be seen when iterating over the groupby object, not the
1145+
order they are first observed.
1146+
1147+
.. ipython:: python
11271148
1128-
df.groupby('A').cumcount()
1149+
dfg = pd.DataFrame(list('aaabba'), columns=['A'])
1150+
dfg
11291151
1130-
df.groupby('A').cumcount(ascending=False) # kwarg only
1152+
dfg.groupby('A').ngroup()
1153+
1154+
dfg.groupby('A').ngroup(ascending=False)
11311155
11321156
Plotting
11331157
~~~~~~~~
@@ -1176,14 +1200,41 @@ Regroup columns of a DataFrame according to their sum, and sum the aggregated on
11761200
df
11771201
df.groupby(df.sum(), axis=1).sum()
11781202
1203+
.. _groupby.multicolumn_factorization
1204+
1205+
Multi-column factorization
1206+
~~~~~~~~~~~~~~~~~~~~~~~~~~
1207+
1208+
By using ``.ngroup()``, we can extract information about the groups in
1209+
a way similar to :func:`factorize` (as described further in the
1210+
:ref:`reshaping API <reshaping.factorization>`) but which applies
1211+
naturally to multiple columns of mixed type and different
1212+
sources. This can be useful as an intermediate categorical-like step
1213+
in processing, when the relationships between the group rows are more
1214+
important than their content, or as input to an algorithm which only
1215+
accepts the integer encoding. (For more information about support in
1216+
pandas for full categorical data, see the :ref:`Categorical
1217+
introduction <categorical>` and the
1218+
:ref:`API documentation <api.categorical>`.)
1219+
1220+
.. ipython:: python
1221+
1222+
dfg = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})
1223+
1224+
dfg
1225+
1226+
dfg.groupby(["A", "B"]).ngroup()
1227+
1228+
dfg.groupby(["A", [0, 0, 0, 1, 1]]).ngroup()
1229+
11791230
Groupby by Indexer to 'resample' data
11801231
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
11811232

1182-
Resampling produces new hypothetical samples(resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.
1233+
Resampling produces new hypothetical samples (resamples) from already existing observed data or from a model that generates data. These new samples are similar to the pre-existing samples.
11831234

11841235
In order to resample to work on indices that are non-datetimelike , the following procedure can be utilized.
11851236

1186-
In the following examples, **df.index // 5** returns a binary array which is used to determine what get's selected for the groupby operation.
1237+
In the following examples, **df.index // 5** returns a binary array which is used to determine what gets selected for the groupby operation.
11871238

11881239
.. note:: The below example shows how we can downsample by consolidation of samples into fewer samples. Here by using **df.index // 5**, we are aggregating the samples in bins. By applying **std()** function, we aggregate the information contained in many samples into a small subset of values which is their standard deviation thereby reducing the number of samples.
11891240

doc/source/reshaping.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -636,7 +636,7 @@ When a column contains only one level, it will be omitted in the result.
636636
637637
pd.get_dummies(df, drop_first=True)
638638
639-
639+
.. _reshaping.factorize:
640640

641641
Factorizing values
642642
------------------

doc/source/whatsnew/v0.20.2.txt

+5
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,11 @@ Enhancements
2323
- ``Series`` provides a ``to_latex`` method (:issue:`16180`)
2424
- Added :attr:`Index.is_strictly_monotonic_increasing` and :attr:`Index.is_strictly_monotonic_decreasing` properties (:issue:`16515`)
2525

26+
- A new groupby method :meth:`~pandas.core.groupby.GroupBy.ngroup`,
27+
parallel to the existing :meth:`~pandas.core.groupby.GroupBy.cumcount`,
28+
has been added to return the group order (:issue:`11642`); see
29+
:ref:`here <groupby.ngroup>`.
30+
2631
.. _whatsnew_0202.performance:
2732

2833
Performance Improvements

pandas/core/groupby.py

+74-1
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,7 @@
150150
'last', 'first',
151151
'head', 'tail', 'median',
152152
'mean', 'sum', 'min', 'max',
153-
'cumcount',
153+
'cumcount', 'ngroup',
154154
'resample',
155155
'rank', 'quantile',
156156
'fillna',
@@ -1437,6 +1437,75 @@ def nth(self, n, dropna=None):
14371437

14381438
return result
14391439

1440+
@Substitution(name='groupby')
1441+
@Appender(_doc_template)
1442+
def ngroup(self, ascending=True):
1443+
"""
1444+
Number each group from 0 to the number of groups - 1.
1445+
1446+
This is the enumerative complement of cumcount. Note that the
1447+
numbers given to the groups match the order in which the groups
1448+
would be seen when iterating over the groupby object, not the
1449+
order they are first observed.
1450+
1451+
.. versionadded:: 0.20.2
1452+
1453+
Parameters
1454+
----------
1455+
ascending : bool, default True
1456+
If False, number in reverse, from number of group - 1 to 0.
1457+
1458+
Examples
1459+
--------
1460+
1461+
>>> df = pd.DataFrame({"A": list("aaabba")})
1462+
>>> df
1463+
A
1464+
0 a
1465+
1 a
1466+
2 a
1467+
3 b
1468+
4 b
1469+
5 a
1470+
>>> df.groupby('A').ngroup()
1471+
0 0
1472+
1 0
1473+
2 0
1474+
3 1
1475+
4 1
1476+
5 0
1477+
dtype: int64
1478+
>>> df.groupby('A').ngroup(ascending=False)
1479+
0 1
1480+
1 1
1481+
2 1
1482+
3 0
1483+
4 0
1484+
5 1
1485+
dtype: int64
1486+
>>> df.groupby(["A", [1,1,2,3,2,1]]).ngroup()
1487+
0 0
1488+
1 0
1489+
2 1
1490+
3 3
1491+
4 2
1492+
5 0
1493+
dtype: int64
1494+
1495+
See also
1496+
--------
1497+
.cumcount : Number the rows in each group.
1498+
1499+
"""
1500+
1501+
self._set_group_selection()
1502+
1503+
index = self._selected_obj.index
1504+
result = Series(self.grouper.group_info[0], index)
1505+
if not ascending:
1506+
result = self.ngroups - 1 - result
1507+
return result
1508+
14401509
@Substitution(name='groupby')
14411510
@Appender(_doc_template)
14421511
def cumcount(self, ascending=True):
@@ -1481,6 +1550,10 @@ def cumcount(self, ascending=True):
14811550
4 0
14821551
5 0
14831552
dtype: int64
1553+
1554+
See also
1555+
--------
1556+
.ngroup : Number the groups themselves.
14841557
"""
14851558

14861559
self._set_group_selection()

0 commit comments

Comments
 (0)