Skip to content

ENH: enumerate groups #11642

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dsm054 opened this issue Nov 18, 2015 · 5 comments · Fixed by #14026
Closed

ENH: enumerate groups #11642

dsm054 opened this issue Nov 18, 2015 · 5 comments · Fixed by #14026

Comments

@dsm054
Copy link
Contributor

dsm054 commented Nov 18, 2015

Sometimes it's handy to have access to a distinct integer for each group. For example, using the (internal) grouper:

>>> df = pd.DataFrame({"a": list("xyyzxy"), "b": list("ab"*3), "c": range(6)})
>>> df["group_id"] = df.groupby(["a","b"]).grouper.group_info[0]
>>> df
   a  b  c  group_id
0  x  a  0         0
1  y  b  1         2
2  y  a  2         1
3  z  b  3         3
4  x  a  4         0
5  y  b  5         2

This can be achieved in a number of ways but none of them are particularly elegant, esp. if we're grouping on multiple keys and/or Series. Accordingly, after a brief discussion on gitter, I propose a new method transform("enumerate") which returns a Series of integers from 0 to ngroups-1 matching the order the groups will be iterated in. In other words, we'll simply be applying the following map:

>>> m = {k: i for i, (k,g) in enumerate(df.groupby(["a","b"]))}
>>> m
{('x', 'a'): 0, ('y', 'b'): 2, ('y', 'a'): 1, ('z', 'b'): 3}

(Note this is only to shows the desired behaviour, and wouldn't be how it'd be implemented!)

@jreback
Copy link
Contributor

jreback commented Nov 18, 2015

can you show an example of its utility!

also to note that this is really only useful as a .transform method (a reduction is kind of silly as its just the range(len(df.groupby(...))))

@shoyer
Copy link
Member

shoyer commented Nov 19, 2015

Note that this is essentially exactly the same information provided by pandas.factorize:

In [1]: import pandas as pd

In [2]: pd.factorize(['a', 'a', 'b', 'c'])
Out[2]: (array([0, 0, 1, 2]), array(['a', 'b', 'c'], dtype=object))

@dsm054
Copy link
Contributor Author

dsm054 commented Nov 19, 2015

I couldn't think of a clean way to get factorize to handle the same inputs as groupby, though (both the multiple-series case and the mixed column-name/list input.) Might have missed something obvious, of course, as is my wont. But if I needed to write a few lines to get it to work, then those lines would more naturally fit as a groupby method, or so it seemed to me.

@dsm054
Copy link
Contributor Author

dsm054 commented Nov 21, 2015

As I went to implement this, I started to wonder if it doesn't make more sense to use df.groupby("a").enumerate() instead of df.groupby("a").transform("enumerate"), to be parallel with df.groupby("a").cumcount(), instead of df.groupby("a").transform("cumcount") (which doesn't work.) This would give us something like

>>> df = pd.DataFrame({"A": [1,2,2,2,1]})
>>> df["group_id"] = df.groupby("A").enumerate()
>>> df["group_index"] = df.groupby("A").cumcount()
>>> df
   A  group_id  group_index
0  1         0            0
1  2         1            0
2  2         1            1
3  2         1            2
4  1         0            1

@jreback
Copy link
Contributor

jreback commented Nov 21, 2015

that looks reasonable

dsm054 added a commit to dsm054/pandas that referenced this issue Mar 22, 2017
dsm054 added a commit to dsm054/pandas that referenced this issue Mar 22, 2017
@jreback jreback modified the milestones: 0.20.2, Next Major Release May 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants