PERF: nunique perf improved by using len(unique) rather than value_counts #9364

gtnx · 2015-01-27T21:47:57Z

Before:

In [1]: s=pandas.Series(range(100)*1000)
In [2]: %timeit s.nunique()
1000 loops, best of 3: 1.43 ms per loop

After:

In [1]: s=pandas.Series(range(100)*1000)
In [4]: %timeit s.nunique()
1000 loops, best of 3: 440 µs per loop

Fixes GH9354

jorisvandenbossche · 2015-01-27T23:16:35Z

Can you also time this for an example with NaNs ?

shoyer · 2015-01-28T03:51:02Z

pandas/core/base.py

@@ -440,7 +440,9 @@ def nunique(self, dropna=True):
        -------
        nunique : int
        """
-        return len(self.value_counts(dropna=dropna))
+        if dropna:
+            return len(set(self.unique()) - {None})


This is probably failing because we usually represent NA in pandas as np.nan, not None

shoyer · 2015-01-28T03:51:42Z

It would be nice to also formalize this timing with a new vbench.

gtnx · 2015-01-28T06:01:56Z

Here are the examples with NaNs:

In [8]: import pandas

In [9]: s=pandas.Series(range(100)*1000)

In [10]: dropna=True

In [11]: %timeit len(s.value_counts(dropna=dropna))
1000 loops, best of 3: 1.6 ms per loop

In [12]: %timeit s.nunique(dropna)
1000 loops, best of 3: 450 µs per loop

In [13]: dropna=False

In [14]: %timeit len(s.value_counts(dropna=dropna))
1000 loops, best of 3: 1.61 ms per loop

In [15]: %timeit s.nunique(dropna)
1000 loops, best of 3: 441 µs per loop

In [16]: s=pandas.Series((range(100)+[None])*1000)

In [17]: dropna=True

In [18]: %timeit len(s.value_counts(dropna=dropna))
100 loops, best of 3: 8.61 ms per loop

In [19]: %timeit s.nunique(dropna)
1000 loops, best of 3: 1.22 ms per loop

In [20]: dropna=False

In [21]: %timeit len(s.value_counts(dropna=dropna))
100 loops, best of 3: 9.24 ms per loop

In [22]: %timeit s.nunique(dropna)
1000 loops, best of 3: 1.2 ms per loop

gtnx · 2015-01-28T06:11:30Z

Could you give me an example for the vbench? Should it be in the pandas project or in the vbench project?

gtnx · 2015-01-28T06:25:53Z

I did not rebase with the last version of master. And this had been done in ff124f9
I close it

jreback · 2015-01-28T11:08:35Z

@gtnx sorry about that I forgot this was fixed (and the issue was orphaned).

gtnx added 3 commits January 27, 2015 22:34

Use unique for nunique method

9ded43b

PERF: Use unique for nunique method

f934481

Fixes GH9354

Merge branch '9354' of github.com:gtnx/pandas into 9354

f84e5d4

gtnx closed this Jan 27, 2015

gtnx reopened this Jan 27, 2015

gtnx mentioned this pull request Jan 27, 2015

PERF: nunique perf can be improved by using len(unique) rather than value_counts #9354

Closed

jorisvandenbossche added the Performance Memory or execution speed performance label Jan 27, 2015

jorisvandenbossche added this to the 0.16.0 milestone Jan 27, 2015

shoyer reviewed Jan 28, 2015
View reviewed changes

BUG: Handling neatly Nan in Series.nunique(GH9354)

1c0d703

gtnx closed this Jan 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: nunique perf improved by using len(unique) rather than value_counts #9364

PERF: nunique perf improved by using len(unique) rather than value_counts #9364

gtnx commented Jan 27, 2015

jorisvandenbossche commented Jan 27, 2015

shoyer Jan 28, 2015

shoyer commented Jan 28, 2015

gtnx commented Jan 28, 2015

gtnx commented Jan 28, 2015

gtnx commented Jan 28, 2015

jreback commented Jan 28, 2015

PERF: nunique perf improved by using len(unique) rather than value_counts #9364

PERF: nunique perf improved by using len(unique) rather than value_counts #9364

Conversation

gtnx commented Jan 27, 2015

jorisvandenbossche commented Jan 27, 2015

shoyer Jan 28, 2015

Choose a reason for hiding this comment

shoyer commented Jan 28, 2015

gtnx commented Jan 28, 2015

gtnx commented Jan 28, 2015

gtnx commented Jan 28, 2015

jreback commented Jan 28, 2015