ENH: enumerate groups #11642

dsm054 · 2015-11-18T22:49:17Z

Sometimes it's handy to have access to a distinct integer for each group. For example, using the (internal) grouper:

>>> df = pd.DataFrame({"a": list("xyyzxy"), "b": list("ab"*3), "c": range(6)})
>>> df["group_id"] = df.groupby(["a","b"]).grouper.group_info[0]
>>> df
   a  b  c  group_id
0  x  a  0         0
1  y  b  1         2
2  y  a  2         1
3  z  b  3         3
4  x  a  4         0
5  y  b  5         2

This can be achieved in a number of ways but none of them are particularly elegant, esp. if we're grouping on multiple keys and/or Series. Accordingly, after a brief discussion on gitter, I propose a new method transform("enumerate") which returns a Series of integers from 0 to ngroups-1 matching the order the groups will be iterated in. In other words, we'll simply be applying the following map:

>>> m = {k: i for i, (k,g) in enumerate(df.groupby(["a","b"]))}
>>> m
{('x', 'a'): 0, ('y', 'b'): 2, ('y', 'a'): 1, ('z', 'b'): 3}

(Note this is only to shows the desired behaviour, and wouldn't be how it'd be implemented!)

The text was updated successfully, but these errors were encountered:

jreback · 2015-11-18T22:53:39Z

can you show an example of its utility!

also to note that this is really only useful as a .transform method (a reduction is kind of silly as its just the range(len(df.groupby(...))))

shoyer · 2015-11-19T00:20:07Z

Note that this is essentially exactly the same information provided by pandas.factorize:

In [1]: import pandas as pd

In [2]: pd.factorize(['a', 'a', 'b', 'c'])
Out[2]: (array([0, 0, 1, 2]), array(['a', 'b', 'c'], dtype=object))

dsm054 · 2015-11-19T00:32:54Z

I couldn't think of a clean way to get factorize to handle the same inputs as groupby, though (both the multiple-series case and the mixed column-name/list input.) Might have missed something obvious, of course, as is my wont. But if I needed to write a few lines to get it to work, then those lines would more naturally fit as a groupby method, or so it seemed to me.

dsm054 · 2015-11-21T17:47:17Z

As I went to implement this, I started to wonder if it doesn't make more sense to use df.groupby("a").enumerate() instead of df.groupby("a").transform("enumerate"), to be parallel with df.groupby("a").cumcount(), instead of df.groupby("a").transform("cumcount") (which doesn't work.) This would give us something like

>>> df = pd.DataFrame({"A": [1,2,2,2,1]})
>>> df["group_id"] = df.groupby("A").enumerate()
>>> df["group_index"] = df.groupby("A").cumcount()
>>> df
   A  group_id  group_index
0  1         0            0
1  2         1            0
2  2         1            1
3  2         1            2
4  1         0            1

jreback · 2015-11-21T18:09:07Z

that looks reasonable

Closes pandas-dev#11642

jreback added Enhancement Groupby API Design Difficulty Intermediate labels Nov 18, 2015

jreback added this to the Next Major Release milestone Nov 18, 2015

dsm054 added a commit to dsm054/pandas that referenced this issue Aug 18, 2016

ENH: Add groupby().enumerate method to count groups (pandas-dev#11642)

a6e60a7

dsm054 mentioned this issue Aug 18, 2016

ENH: Add groupby().ngroup() method to count groups (#11642) #14026

Merged

4 tasks

dsm054 added a commit to dsm054/pandas that referenced this issue Mar 22, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026)

7aee071

Closes pandas-dev#11642

dsm054 added a commit to dsm054/pandas that referenced this issue Mar 22, 2017

ENH: add .ngroup() method to groupby objects (pandas-dev#14026)

966f9be

Closes pandas-dev#11642

jreback modified the milestones: 0.20.2, Next Major Release May 31, 2017

jreback closed this as completed in #14026 Jun 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: enumerate groups #11642

ENH: enumerate groups #11642

dsm054 commented Nov 18, 2015

jreback commented Nov 18, 2015

shoyer commented Nov 19, 2015

dsm054 commented Nov 19, 2015

dsm054 commented Nov 21, 2015

jreback commented Nov 21, 2015

ENH: enumerate groups #11642

ENH: enumerate groups #11642

Comments

dsm054 commented Nov 18, 2015

jreback commented Nov 18, 2015

shoyer commented Nov 19, 2015

dsm054 commented Nov 19, 2015

dsm054 commented Nov 21, 2015

jreback commented Nov 21, 2015