Skip to content

PERF: nunique perf improved by using len(unique) rather than value_counts #9364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

gtnx
Copy link

@gtnx gtnx commented Jan 27, 2015

Before:

In [1]: s=pandas.Series(range(100)*1000)
In [2]: %timeit s.nunique()
1000 loops, best of 3: 1.43 ms per loop

After:

In [1]: s=pandas.Series(range(100)*1000)
In [4]: %timeit s.nunique()
1000 loops, best of 3: 440 µs per loop

@gtnx gtnx closed this Jan 27, 2015
@gtnx gtnx reopened this Jan 27, 2015
@jorisvandenbossche jorisvandenbossche added the Performance Memory or execution speed performance label Jan 27, 2015
@jorisvandenbossche jorisvandenbossche added this to the 0.16.0 milestone Jan 27, 2015
@jorisvandenbossche
Copy link
Member

Can you also time this for an example with NaNs ?

@@ -440,7 +440,9 @@ def nunique(self, dropna=True):
-------
nunique : int
"""
return len(self.value_counts(dropna=dropna))
if dropna:
return len(set(self.unique()) - {None})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably failing because we usually represent NA in pandas as np.nan, not None

@shoyer
Copy link
Member

shoyer commented Jan 28, 2015

It would be nice to also formalize this timing with a new vbench.

@gtnx
Copy link
Author

gtnx commented Jan 28, 2015

Here are the examples with NaNs:

In [8]: import pandas

In [9]: s=pandas.Series(range(100)*1000)

In [10]: dropna=True

In [11]: %timeit len(s.value_counts(dropna=dropna))
1000 loops, best of 3: 1.6 ms per loop

In [12]: %timeit s.nunique(dropna)
1000 loops, best of 3: 450 µs per loop

In [13]: dropna=False

In [14]: %timeit len(s.value_counts(dropna=dropna))
1000 loops, best of 3: 1.61 ms per loop

In [15]: %timeit s.nunique(dropna)
1000 loops, best of 3: 441 µs per loop

In [16]: s=pandas.Series((range(100)+[None])*1000)

In [17]: dropna=True

In [18]: %timeit len(s.value_counts(dropna=dropna))
100 loops, best of 3: 8.61 ms per loop

In [19]: %timeit s.nunique(dropna)
1000 loops, best of 3: 1.22 ms per loop

In [20]: dropna=False

In [21]: %timeit len(s.value_counts(dropna=dropna))
100 loops, best of 3: 9.24 ms per loop

In [22]: %timeit s.nunique(dropna)
1000 loops, best of 3: 1.2 ms per loop

@gtnx
Copy link
Author

gtnx commented Jan 28, 2015

Could you give me an example for the vbench? Should it be in the pandas project or in the vbench project?

@gtnx
Copy link
Author

gtnx commented Jan 28, 2015

I did not rebase with the last version of master. And this had been done in ff124f9
I close it

@gtnx gtnx closed this Jan 28, 2015
@jreback
Copy link
Contributor

jreback commented Jan 28, 2015

@gtnx sorry about that I forgot this was fixed (and the issue was orphaned).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants