nunique performance for groupby with large number of groups #10820

aldanor · 2015-08-14T10:53:45Z

It looks like len(set) beats both len(np.unique) and pd.Series.nunique if done naively -- here's an example with a large number of groups where we try to compute unique counts of a column when grouping by another column:

>>> df = pd.DataFrame({'a': np.random.randint(10000, size=100000),
                       'b': np.random.randint(10, size=100000)})
>>> g = df.groupby('a')

>>> %timeit g.b.nunique()
1 loops, best of 3: 1 s per loop

>>> %timeit g.b.apply(pd.Series.nunique)
1 loops, best of 3: 992 ms per loop

>>> %timeit g.b.apply(lambda x: np.unique(x.values).size)
1 loops, best of 3: 652 ms per loop

>>> %timeit g.b.apply(lambda x: len(set(x.values)))
1 loops, best of 3: 469 ms per loop

The fastest way I know to accomplish the same thing is this:

>>> g = df.groupby(['a', 'b'])

>>> %timeit g.b.first().groupby(level=0).size()
100 loops, best of 3: 6.2 ms per loop

... which is a LOT faster apparently.

Wonder if something similar could be done in GroupBy.nunique since it's quite a common use case?

The text was updated successfully, but these errors were encountered:

jreback · 2015-08-14T13:56:56Z

sure that's a clever way of doing this. nunique is just a dispatched method back to the object. Essentially a loop over each of the sub-dataframes. So you could simply implement this as a method on a NDFrameGroubpBy or maybe just a SeriesGroupby object. Alternatively this is an easy cython method.

Note that their is a passed dropna=True method which should be implemented as well (as it exists on .nunique now), but that's easy.

can you submit a pull-request?

aldanor · 2015-08-17T23:24:15Z

@jreback Sure, could give it a try. Would appreciate if you could give some pointers as to where this sort of thing should go -- I'm only vaguely familiar with pandas codebase.

As in, there's NDFrameGroupBy, DataFrameGroupBy, PanelGroupBy, SeriesGroupBy, should it become an instance method on one of these? The only special methods I see on NDFrameGroupBy are transform, aggregateand filter which seem pretty generic.

Also, if this is to be supported in NDFrameGroupBy, there would have to be an axis keyword argument, I think?

jreback · 2015-08-17T23:29:40Z

This just becomes a method on SeriesGroupBy. (which is what you get when you do df.groubpy('a').b for exmple. Methods to this are then dispatched to the actual generated series unless their is a method defined for it (e.g. things like .min/max/mean etc, IOW, the cython methods.

currently this is defined on the whitelist (e.g. just search for nunique). so would take it out of there and define it as a function.

aldanor · 2015-08-18T00:09:18Z

Ok thanks!. By the way, would it make sense for df.groupby('a').nunique() to returns a dataframe with unique counts per column? (it won't work with value_counts due to mismatching return value shapes but here it could).

Another somewhat related note -- the whole whitelist thing in GroupBy seems extremely hacky, why not use metaclasses to generate methods rather than exec strings?

jreback · 2015-08-18T00:20:53Z

.nunique is very similar in concept to the following. In factor you could simply write some cython code that emulates count (which is just the number of non-nulls per group).

In [4]: df = DataFrame({'A' : ['foo','bar','bar'], 'B' : [1,2,3], 'C' : [1,1,1]})

In [5]: df
Out[5]: 
     A  B  C
0  foo  1  1
1  bar  2  1
2  bar  3  1

In [6]: df.groupby('A').count()
Out[6]: 
     B  C
A        
bar  2  2
foo  1  1

If you wanted to try to change using the whitelist that would be ok. You could use a MetaClass, not sure if that would be much less magical, but maybe.

aldanor · 2015-08-18T01:06:05Z

Hmm not really though, since in your example counts for column C are [2, 1] whereas unique counts are [1, 1]. It's more like you have to groupby/reindex by ['A', 'B'] and get counts for the second level of the index for each value of the first level, and then do the same for ['A', 'C'].

jreback · 2015-08-18T01:07:23Z

I didn't say the result was the same just the idea

aldanor · 2015-08-18T01:13:41Z

Looks like in the most naive impl this would require to construct a groupby object to from another instantiated groupby object but also grouped by one additional key, wonder if there's any builtin functionality for that?

aldanor changed the title ~~nunique for groupby with large number of groups~~ nunique performance for groupby with large number of groups Aug 14, 2015

jreback added Groupby Performance Memory or execution speed performance labels Aug 14, 2015

jreback added this to the Next Major Release milestone Aug 14, 2015

behzadnouri mentioned this issue Aug 24, 2015

PERF: improves SeriesGroupBy.nunique performance #10894

Merged

jreback modified the milestones: 0.17.0, Next Major Release Aug 24, 2015

jreback closed this as completed in #10894 Aug 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nunique performance for groupby with large number of groups #10820

nunique performance for groupby with large number of groups #10820

aldanor commented Aug 14, 2015

jreback commented Aug 14, 2015

aldanor commented Aug 17, 2015

jreback commented Aug 17, 2015

aldanor commented Aug 18, 2015

jreback commented Aug 18, 2015

aldanor commented Aug 18, 2015

jreback commented Aug 18, 2015

aldanor commented Aug 18, 2015

nunique performance for groupby with large number of groups #10820

nunique performance for groupby with large number of groups #10820

Comments

aldanor commented Aug 14, 2015

jreback commented Aug 14, 2015

aldanor commented Aug 17, 2015

jreback commented Aug 17, 2015

aldanor commented Aug 18, 2015

jreback commented Aug 18, 2015

aldanor commented Aug 18, 2015

jreback commented Aug 18, 2015

aldanor commented Aug 18, 2015