Skip to content

nunique performance for groupby with large number of groups #10820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aldanor opened this issue Aug 14, 2015 · 8 comments · Fixed by #10894
Closed

nunique performance for groupby with large number of groups #10820

aldanor opened this issue Aug 14, 2015 · 8 comments · Fixed by #10894
Labels
Groupby Performance Memory or execution speed performance
Milestone

Comments

@aldanor
Copy link
Contributor

aldanor commented Aug 14, 2015

It looks like len(set) beats both len(np.unique) and pd.Series.nunique if done naively -- here's an example with a large number of groups where we try to compute unique counts of a column when grouping by another column:

>>> df = pd.DataFrame({'a': np.random.randint(10000, size=100000),
                       'b': np.random.randint(10, size=100000)})
>>> g = df.groupby('a')

>>> %timeit g.b.nunique()
1 loops, best of 3: 1 s per loop

>>> %timeit g.b.apply(pd.Series.nunique)
1 loops, best of 3: 992 ms per loop

>>> %timeit g.b.apply(lambda x: np.unique(x.values).size)
1 loops, best of 3: 652 ms per loop

>>> %timeit g.b.apply(lambda x: len(set(x.values)))
1 loops, best of 3: 469 ms per loop

The fastest way I know to accomplish the same thing is this:

>>> g = df.groupby(['a', 'b'])

>>> %timeit g.b.first().groupby(level=0).size()
100 loops, best of 3: 6.2 ms per loop

... which is a LOT faster apparently.

Wonder if something similar could be done in GroupBy.nunique since it's quite a common use case?

@aldanor aldanor changed the title nunique for groupby with large number of groups nunique performance for groupby with large number of groups Aug 14, 2015
@jreback
Copy link
Contributor

jreback commented Aug 14, 2015

sure that's a clever way of doing this. nunique is just a dispatched method back to the object. Essentially a loop over each of the sub-dataframes. So you could simply implement this as a method on a NDFrameGroubpBy or maybe just a SeriesGroupby object. Alternatively this is an easy cython method.

Note that their is a passed dropna=True method which should be implemented as well (as it exists on .nunique now), but that's easy.

can you submit a pull-request?

@jreback jreback added Groupby Performance Memory or execution speed performance labels Aug 14, 2015
@jreback jreback added this to the Next Major Release milestone Aug 14, 2015
@aldanor
Copy link
Contributor Author

aldanor commented Aug 17, 2015

@jreback Sure, could give it a try. Would appreciate if you could give some pointers as to where this sort of thing should go -- I'm only vaguely familiar with pandas codebase.

As in, there's NDFrameGroupBy, DataFrameGroupBy, PanelGroupBy, SeriesGroupBy, should it become an instance method on one of these? The only special methods I see on NDFrameGroupBy are transform, aggregateand filter which seem pretty generic.

Also, if this is to be supported in NDFrameGroupBy, there would have to be an axis keyword argument, I think?

@jreback
Copy link
Contributor

jreback commented Aug 17, 2015

This just becomes a method on SeriesGroupBy. (which is what you get when you do df.groubpy('a').b for exmple. Methods to this are then dispatched to the actual generated series unless their is a method defined for it (e.g. things like .min/max/mean etc, IOW, the cython methods.

currently this is defined on the whitelist (e.g. just search for nunique). so would take it out of there and define it as a function.

@aldanor
Copy link
Contributor Author

aldanor commented Aug 18, 2015

Ok thanks!. By the way, would it make sense for df.groupby('a').nunique() to returns a dataframe with unique counts per column? (it won't work with value_counts due to mismatching return value shapes but here it could).

Another somewhat related note -- the whole whitelist thing in GroupBy seems extremely hacky, why not use metaclasses to generate methods rather than exec strings?

@jreback
Copy link
Contributor

jreback commented Aug 18, 2015

.nunique is very similar in concept to the following. In factor you could simply write some cython code that emulates count (which is just the number of non-nulls per group).

In [4]: df = DataFrame({'A' : ['foo','bar','bar'], 'B' : [1,2,3], 'C' : [1,1,1]})

In [5]: df
Out[5]: 
     A  B  C
0  foo  1  1
1  bar  2  1
2  bar  3  1

In [6]: df.groupby('A').count()
Out[6]: 
     B  C
A        
bar  2  2
foo  1  1

If you wanted to try to change using the whitelist that would be ok. You could use a MetaClass, not sure if that would be much less magical, but maybe.

@aldanor
Copy link
Contributor Author

aldanor commented Aug 18, 2015

Hmm not really though, since in your example counts for column C are [2, 1] whereas unique counts are [1, 1]. It's more like you have to groupby/reindex by ['A', 'B'] and get counts for the second level of the index for each value of the first level, and then do the same for ['A', 'C'].

@jreback
Copy link
Contributor

jreback commented Aug 18, 2015

I didn't say the result was the same just the idea

@aldanor
Copy link
Contributor Author

aldanor commented Aug 18, 2015

Looks like in the most naive impl this would require to construct a groupby object to from another instantiated groupby object but also grouped by one additional key, wonder if there's any builtin functionality for that?

@jreback jreback modified the milestones: 0.17.0, Next Major Release Aug 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants