-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: use unique and isnull in nunique instead of value_counts. #9134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
this will close #7771 ? |
@jreback: This PR closes GH 7771 insofar as they both use @lexual's benchmark run on this branch is about 4.5x faster than the benchmark There are undoubtedly better ways to handle |
well ok will leave the other branch open |
@unutbu pls add a release note for the 2 fixes and good 2 go |
faf3fec
to
d977501
Compare
Apologies I've never had time to follow up on my initial work. Hopefully I might get a chance over Xmas to look over @unutbu changes & compare performance. Would be great to get a fix in the next release, as I've been monkey patching my code to work around this slowness ;)
|
cc @lexual thanks! |
PERF: use unique and isnull in nunique instead of value_counts.
d977501
to
ff124f9
Compare
@jreback: good to go? |
PERF: use unique and isnull in nunique instead of value_counts.
@unutbu thanks! |
closes #9129
Currently,
Series.nunique
(source) callsSeries.value_counts
(source), which by default, sorts the values.Counting unique values certainly doesn't require sorting, so we could fix this
by passing
sort=False
tovalue_counts
.But
nunique
can also be calculated by callingSeries.unique
instead ofvalue_counts
, and usingcom.isnull
to handle thedropna
parameter.This PR attempts to implement this. Here is a vbench perf test which seems to show an improvement for tests using
nunique
.Here are the best and worst ratios:
While working on this PR I ran into issue GH 9129.
This PR includes Jeff's suggested fix.