PERF: improves SeriesGroupBy.nunique performance #10894

behzadnouri · 2015-08-24T00:42:38Z

on master:

In [2]: df = pd.DataFrame({'a': np.random.randint(10000, size=100000),
   ...:                    'b': np.random.randint(10, size=100000)})

In [3]: %timeit df.groupby('a')['b'].nunique()
1 loops, best of 3: 1.66 s per loop

In [4]: %timeit df.groupby(['a', 'b'])['b'].first().groupby(level=0).size()
10 loops, best of 3: 36.3 ms per loop

on branch:

In [2]: %timeit df.groupby('a')['b'].nunique()
10 loops, best of 3: 29.2 ms per loop

behzadnouri · 2015-08-24T16:06:41Z

benchmarks:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_ngroups_10000_nunique                |   6.8550 | 1701.2453 |   0.0040 |
groupby_ngroups_100_nunique                  |   0.5283 |  18.2413 |   0.0290 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [091c73d] : PERF: improves SeriesGroupBy.nunique performance
Base   [54f02df] : COMPAT: value_counts always return int64 dtype, xref #10876

jreback · 2015-08-24T18:21:58Z

@behzadnouri looks great. can you add a benchmark as well. ping when green.

behzadnouri · 2015-08-24T18:27:09Z

@jreback there already are 2 benchmarks

jreback · 2015-08-24T18:33:39Z

oh, right, ok then.

PERF: improves SeriesGroupBy.nunique performance

jreback · 2015-08-24T18:34:06Z

@behzadnouri thanks!

aldanor · 2015-08-24T19:44:45Z

That's awesome, sort of what I was thinking about but didn't have time to hack together :/ Thanks @behzadnouri!

behzadnouri force-pushed the grby-nunique branch 2 times, most recently from f6e97ac to 64f445e Compare August 24, 2015 11:04

PERF: improves SeriesGroupBy.nunique performance

091c73d

behzadnouri force-pushed the grby-nunique branch from 64f445e to 091c73d Compare August 24, 2015 11:32

jreback added Groupby Performance Memory or execution speed performance labels Aug 24, 2015

jreback added this to the 0.17.0 milestone Aug 24, 2015

jreback added a commit that referenced this pull request Aug 24, 2015

Merge pull request #10894 from behzadnouri/grby-nunique

07042a9

PERF: improves SeriesGroupBy.nunique performance

jreback merged commit 07042a9 into pandas-dev:master Aug 24, 2015

jreback mentioned this pull request Sep 12, 2015

Broken nunique on Series group by #11077

Closed

behzadnouri deleted the grby-nunique branch November 11, 2015 16:27

jorisvandenbossche mentioned this pull request Nov 18, 2015

BUG: groupby nunique with Categorical and missing categories gives ValueError #11635

Closed

dsaxton mentioned this pull request May 9, 2020

TST: Mark groupby.nunique test as slow #34096

Merged

jorisvandenbossche mentioned this pull request Nov 17, 2023

PERF: nunique is slower than unique.apply(len) on a groupby #55972

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: improves SeriesGroupBy.nunique performance #10894

PERF: improves SeriesGroupBy.nunique performance #10894

behzadnouri commented Aug 24, 2015

behzadnouri commented Aug 24, 2015

jreback commented Aug 24, 2015

behzadnouri commented Aug 24, 2015

jreback commented Aug 24, 2015

jreback commented Aug 24, 2015

aldanor commented Aug 24, 2015

PERF: improves SeriesGroupBy.nunique performance #10894

PERF: improves SeriesGroupBy.nunique performance #10894

Conversation

behzadnouri commented Aug 24, 2015

behzadnouri commented Aug 24, 2015

jreback commented Aug 24, 2015

behzadnouri commented Aug 24, 2015

jreback commented Aug 24, 2015

jreback commented Aug 24, 2015

aldanor commented Aug 24, 2015