PERF: use unique and isnull in nunique instead of value_counts. #9134

unutbu · 2014-12-22T21:00:13Z

Currently, Series.nunique (source) calls Series.value_counts (source), which by default, sorts the values.

Counting unique values certainly doesn't require sorting, so we could fix this
by passing sort=False to value_counts.

But nunique can also be calculated by calling Series.unique instead of
value_counts, and using com.isnull to handle the dropna parameter.

This PR attempts to implement this. Here is a vbench perf test which seems to show an improvement for tests using nunique.

/usr/bin/time -v ./test_perf.sh -b master -t nunique-unique -r groupby

Here are the best and worst ratios:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_ngroups_10000_nunique                | 900.9844 | 3768.5087 |   0.2391 |
groupby_ngroups_100_nunique                  |   9.6850 |  38.9043 |   0.2489 |
groupby_ngroups_100_sem                      |   0.7834 |   1.4904 |   0.5256 |
groupby_ngroups_100_size                     |   0.4920 |   0.8560 |   0.5748 |
groupby_ngroups_10000_max                    |   2.5337 |   4.1683 |   0.6078 |
groupby_frame_nth_none                       |   2.3570 |   3.2880 |   0.7168 |
groupby_ngroups_10000_var                    |   2.4277 |   3.3390 |   0.7271 |
groupby_ngroups_10000_sem                    |   3.5107 |   4.6523 |   0.7546 |
...
groupby_transform_multi_key3                 | 630.2720 | 599.7047 |   1.0510 |
groupby_nth_datetimes_none                   | 424.5253 | 403.6326 |   1.0518 |
groupby_transform_series2                    | 109.7130 | 104.1580 |   1.0533 |
groupby_transform_multi_key1                 |  58.9686 |  55.7120 |   1.0585 |
groupby_nth_datetimes_any                    | 1206.7020 | 1132.4236 |   1.0656 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

While working on this PR I ran into issue GH 9129.
This PR includes Jeff's suggested fix.

jreback · 2014-12-22T21:38:58Z

this will close #7771 ?

jreback · 2014-12-23T00:06:20Z

@unutbu looks good. Pls add a release note for the closing of #9129 (and #7771, I think).

unutbu · 2014-12-23T01:30:34Z

@jreback: This PR closes GH 7771 insofar as they both use len(s.unique()), but they differ in the
way they try to implement dropna.

@lexual's benchmark run on this branch is about 4.5x faster than the benchmark
run on master. But this branch is 2x slower than calling len(s.unique())
alone. The difference of course is due to the the call to isnull.

There are undoubtedly better ways to handle dropna. Ideally NumPy would have a
naunique method for ndarrays which ignores NaNs, and nanops.unique1d would have a dropna parameter.
But I don't really know how to implement that.

jreback · 2014-12-23T01:36:10Z

well numpy is not changing very fast so would not hold your breath :)
(and wouldn't make much difference anyhow, the null routines are all in cython so pretty fast; I am sure straight c would be faster), but lots of special cases and handling for dtypes would impact that the same as pandas.

ok will leave the other branch open

jreback · 2014-12-23T01:36:35Z

@unutbu pls add a release note for the 2 fixes and good 2 go

lexual · 2014-12-23T01:41:47Z

Apologies I've never had time to follow up on my initial work.

Hopefully I might get a chance over Xmas to look over @unutbu changes & compare performance.

Would be great to get a fix in the next release, as I've been monkey patching my code to work around this slowness ;)

pd.Series.nunique = lambda self: len(self.dropna().unique())

jreback · 2014-12-23T01:42:41Z

cc @lexual thanks!

PERF: use unique and isnull in nunique instead of value_counts.

unutbu · 2014-12-23T03:18:03Z

@jreback: good to go?

PERF: use unique and isnull in nunique instead of value_counts.

jreback · 2014-12-23T03:23:22Z

@unutbu thanks!

jreback added Performance Memory or execution speed performance Groupby labels Dec 22, 2014

jreback added this to the 0.16.0 milestone Dec 22, 2014

unutbu force-pushed the nunique-unique branch from faf3fec to d977501 Compare December 23, 2014 01:40

BUG: fix isnull behavior when passed PeriodIndex with NaT (GH 9129)

ff124f9

PERF: use unique and isnull in nunique instead of value_counts.

unutbu force-pushed the nunique-unique branch from d977501 to ff124f9 Compare December 23, 2014 02:17

jreback added a commit that referenced this pull request Dec 23, 2014

Merge pull request #9134 from unutbu/nunique-unique

eb77d1d

PERF: use unique and isnull in nunique instead of value_counts.

jreback merged commit eb77d1d into pandas-dev:master Dec 23, 2014

jreback mentioned this pull request Dec 23, 2014

nunique is slower than len(set(x.dropna())) for smaller Series. #7771

Closed

sinhrks mentioned this pull request May 9, 2015

BUG: isnull doesnt handle PeriodNaT properly #7557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: use unique and isnull in nunique instead of value_counts. #9134

PERF: use unique and isnull in nunique instead of value_counts. #9134

unutbu commented Dec 22, 2014

jreback commented Dec 22, 2014

jreback commented Dec 23, 2014

unutbu commented Dec 23, 2014

jreback commented Dec 23, 2014

jreback commented Dec 23, 2014

lexual commented Dec 23, 2014

jreback commented Dec 23, 2014

unutbu commented Dec 23, 2014

jreback commented Dec 23, 2014

PERF: use unique and isnull in nunique instead of value_counts. #9134

PERF: use unique and isnull in nunique instead of value_counts. #9134

Conversation

unutbu commented Dec 22, 2014

jreback commented Dec 22, 2014

jreback commented Dec 23, 2014

unutbu commented Dec 23, 2014

jreback commented Dec 23, 2014

jreback commented Dec 23, 2014

lexual commented Dec 23, 2014

jreback commented Dec 23, 2014

unutbu commented Dec 23, 2014

jreback commented Dec 23, 2014