Skip to content

PERF: use unique and isnull in nunique instead of value_counts. #9134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 23, 2014

Conversation

unutbu
Copy link
Contributor

@unutbu unutbu commented Dec 22, 2014

closes #9129

Currently, Series.nunique (source) calls Series.value_counts (source), which by default, sorts the values.

Counting unique values certainly doesn't require sorting, so we could fix this
by passing sort=False to value_counts.

But nunique can also be calculated by calling Series.unique instead of
value_counts, and using com.isnull to handle the dropna parameter.

This PR attempts to implement this. Here is a vbench perf test which seems to show an improvement for tests using nunique.

/usr/bin/time -v ./test_perf.sh -b master -t nunique-unique -r groupby

Here are the best and worst ratios:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_ngroups_10000_nunique                | 900.9844 | 3768.5087 |   0.2391 |
groupby_ngroups_100_nunique                  |   9.6850 |  38.9043 |   0.2489 |
groupby_ngroups_100_sem                      |   0.7834 |   1.4904 |   0.5256 |
groupby_ngroups_100_size                     |   0.4920 |   0.8560 |   0.5748 |
groupby_ngroups_10000_max                    |   2.5337 |   4.1683 |   0.6078 |
groupby_frame_nth_none                       |   2.3570 |   3.2880 |   0.7168 |
groupby_ngroups_10000_var                    |   2.4277 |   3.3390 |   0.7271 |
groupby_ngroups_10000_sem                    |   3.5107 |   4.6523 |   0.7546 |
...
groupby_transform_multi_key3                 | 630.2720 | 599.7047 |   1.0510 |
groupby_nth_datetimes_none                   | 424.5253 | 403.6326 |   1.0518 |
groupby_transform_series2                    | 109.7130 | 104.1580 |   1.0533 |
groupby_transform_multi_key1                 |  58.9686 |  55.7120 |   1.0585 |
groupby_nth_datetimes_any                    | 1206.7020 | 1132.4236 |   1.0656 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

While working on this PR I ran into issue GH 9129.
This PR includes Jeff's suggested fix.

@jreback
Copy link
Contributor

jreback commented Dec 22, 2014

this will close #7771 ?

@jreback jreback added Performance Memory or execution speed performance Groupby labels Dec 22, 2014
@jreback jreback added this to the 0.16.0 milestone Dec 22, 2014
@jreback
Copy link
Contributor

jreback commented Dec 23, 2014

@unutbu looks good. Pls add a release note for the closing of #9129 (and #7771, I think).

@unutbu
Copy link
Contributor Author

unutbu commented Dec 23, 2014

@jreback: This PR closes GH 7771 insofar as they both use len(s.unique()), but they differ in the
way they try to implement dropna.

@lexual's benchmark run on this branch is about 4.5x faster than the benchmark
run on master. But this branch is 2x slower than calling len(s.unique())
alone. The difference of course is due to the the call to isnull.

There are undoubtedly better ways to handle dropna. Ideally NumPy would have a
naunique method for ndarrays which ignores NaNs, and nanops.unique1d would have a dropna parameter.
But I don't really know how to implement that.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2014

well numpy is not changing very fast so would not hold your breath :)
(and wouldn't make much difference anyhow, the null routines are all in cython so pretty fast; I am sure straight c would be faster), but lots of special cases and handling for dtypes would impact that the same as pandas.

ok will leave the other branch open

@jreback
Copy link
Contributor

jreback commented Dec 23, 2014

@unutbu pls add a release note for the 2 fixes and good 2 go

@lexual
Copy link
Contributor

lexual commented Dec 23, 2014

Apologies I've never had time to follow up on my initial work.

Hopefully I might get a chance over Xmas to look over @unutbu changes & compare performance.

Would be great to get a fix in the next release, as I've been monkey patching my code to work around this slowness ;)

pd.Series.nunique = lambda self: len(self.dropna().unique())

@jreback
Copy link
Contributor

jreback commented Dec 23, 2014

cc @lexual thanks!

PERF: use unique and isnull in nunique instead of value_counts.
@unutbu
Copy link
Contributor Author

unutbu commented Dec 23, 2014

@jreback: good to go?

jreback added a commit that referenced this pull request Dec 23, 2014
PERF: use unique and isnull in nunique instead of value_counts.
@jreback jreback merged commit eb77d1d into pandas-dev:master Dec 23, 2014
@jreback
Copy link
Contributor

jreback commented Dec 23, 2014

@unutbu thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

isnull not detecting NaT in PeriodIndex
3 participants