Skip to content

PERF: DataFrame.groupby.nunique is non-performant #15197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Jan 23, 2017 · 1 comment
Closed

PERF: DataFrame.groupby.nunique is non-performant #15197

jreback opened this issue Jan 23, 2017 · 1 comment
Labels
Groupby Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jan 23, 2017

xref #14376

# from the asv
In [10]: n = 10000
    ...:     df = DataFrame({'key1': randint(0, 500, size=n),
    ...:                              'key2': randint(0, 100, size=n),
    ...:                              'ints': randint(0, 1000, size=n),
    ...:                              'ints2': randint(0, 1000, size=n), })
    ...: 

In [11]: %timeit df.groupby(['key1', 'key2']).nunique()
1 loop, best of 3: 4.25 s per loop

In [12]: result = df.groupby(['key1', 'key2']).nunique()

In [13]: g = df.groupby(['key1', 'key2'])

In [14]: expected = pd.concat([getattr(g, col).nunique() for col in g._selected_obj.columns], axis=1)

In [15]: result.equals(expected)
Out[15]: True

In [16]: %timeit pd.concat([getattr(g, col).nunique() for col in g._selected_obj.columns], axis=1)
100 loops, best of 3: 6.94 ms per loop

Series.groupby.nunique has a very performant implementation, but the way the DataFrame.groupby.nunique is implemented (via .apply) it ends up in a python loop over the groups, which nullifies this.

should be straightforward to fix this. need to make sure to test with as_index=True/False

@jreback jreback added Difficulty Intermediate Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 23, 2017
@jreback jreback added this to the 0.20.0 milestone Jan 23, 2017
@jreback
Copy link
Contributor Author

jreback commented Jan 23, 2017

cc @xflr6

@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label Jan 23, 2017
jreback added a commit to jreback/pandas that referenced this issue Jan 23, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
closes pandas-dev#15197

Author: Jeff Reback <[email protected]>

Closes pandas-dev#15201 from jreback/nunique and squashes the following commits:

6d02616 [Jeff Reback] PERF: DataFrame.groupby.nunique
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants