Skip to content

PERF: improves SeriesGroupBy.nunique performance #10894

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 24, 2015

Conversation

behzadnouri
Copy link
Contributor

closes #10820

on master:

In [2]: df = pd.DataFrame({'a': np.random.randint(10000, size=100000),
   ...:                    'b': np.random.randint(10, size=100000)})

In [3]: %timeit df.groupby('a')['b'].nunique()
1 loops, best of 3: 1.66 s per loop

In [4]: %timeit df.groupby(['a', 'b'])['b'].first().groupby(level=0).size()
10 loops, best of 3: 36.3 ms per loop

on branch:

In [2]: %timeit df.groupby('a')['b'].nunique()
10 loops, best of 3: 29.2 ms per loop

@behzadnouri behzadnouri force-pushed the grby-nunique branch 2 times, most recently from f6e97ac to 64f445e Compare August 24, 2015 11:04
@behzadnouri
Copy link
Contributor Author

benchmarks:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_ngroups_10000_nunique                |   6.8550 | 1701.2453 |   0.0040 |
groupby_ngroups_100_nunique                  |   0.5283 |  18.2413 |   0.0290 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [091c73d] : PERF: improves SeriesGroupBy.nunique performance
Base   [54f02df] : COMPAT: value_counts always return int64 dtype, xref #10876

@jreback
Copy link
Contributor

jreback commented Aug 24, 2015

@behzadnouri looks great. can you add a benchmark as well. ping when green.

@jreback jreback added Groupby Performance Memory or execution speed performance labels Aug 24, 2015
@jreback jreback added this to the 0.17.0 milestone Aug 24, 2015
@behzadnouri
Copy link
Contributor Author

@jreback there already are 2 benchmarks

@jreback
Copy link
Contributor

jreback commented Aug 24, 2015

oh, right, ok then.

jreback added a commit that referenced this pull request Aug 24, 2015
PERF: improves SeriesGroupBy.nunique performance
@jreback jreback merged commit 07042a9 into pandas-dev:master Aug 24, 2015
@jreback
Copy link
Contributor

jreback commented Aug 24, 2015

@behzadnouri thanks!

@aldanor
Copy link
Contributor

aldanor commented Aug 24, 2015

That's awesome, sort of what I was thinking about but didn't have time to hack together :/ Thanks @behzadnouri!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nunique performance for groupby with large number of groups
3 participants