Skip to content

nunique is slower than len(set(x.dropna())) for smaller Series. #7771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lexual opened this issue Jul 17, 2014 · 17 comments
Closed

nunique is slower than len(set(x.dropna())) for smaller Series. #7771

lexual opened this issue Jul 17, 2014 · 17 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@lexual
Copy link
Contributor

lexual commented Jul 17, 2014

This is biting me when applying nunique on some groupby operations.

In one benchmark, which I'll shortly share, nunique() is slower until we get to a length of 3000.

@lexual
Copy link
Contributor Author

lexual commented Jul 17, 2014

See a benchmark here:

https://gist.github.com/lexual/118084593a98aa472289

@lexual
Copy link
Contributor Author

lexual commented Jul 17, 2014

So I'm wondering if we can use len(set(x.dropna())) where the Series length is below a certain threshold.

e.g. pseudo code.

def nunique(s):
    threshhold = 3000
    if len(s) < threshold:
        return len(set(s.dropna()))
    else:
        return current_nunique_implementation(s)

@jreback
Copy link
Contributor

jreback commented Jul 17, 2014

actually this looks changed recently

time this with

len(unique())

this was replaced by len(value_counts())

iIIRC because that handles all dtypes properly (this is in core/base.py)

@lexual
Copy link
Contributor Author

lexual commented Jul 17, 2014

Yes len(s.unique()) looks faster across the board, is that what nunique() is now using?

@jreback
Copy link
Contributor

jreback commented Jul 17, 2014

I think it was
was changed somewhat recently

cc @sinhrks ?

@lexual
Copy link
Contributor Author

lexual commented Jul 17, 2014

@jreback
Copy link
Contributor

jreback commented Jul 17, 2014

u want I do a pr with a vbench and change to using len(unique)
(also need to handle dropna argument)
FYI this is used by series and index which doesn't have dropna (that may be why value_counts is used(

@jreback
Copy link
Contributor

jreback commented Jul 17, 2014

yep

@lexual
Copy link
Contributor Author

lexual commented Jul 17, 2014

Updated my gist benchmark thing:

https://gist.github.com/lexual/118084593a98aa472289

@sinhrks
Copy link
Member

sinhrks commented Jul 17, 2014

nunique had been used value_counts prior to #6734, thus I'm not sure the actual background. it may be a reason that unique and value_counts handles NaT differently (fixed in #7424).

I agree len(series.unique()) looks better.

@jreback
Copy link
Contributor

jreback commented Jul 17, 2014

@sinhrks yep the NaT treatment might be slightly different (so maybe need to fix that!!).

@lexual since you brought this up, wnt to do a PR to revert back to using unique, and see if anything pops up?

@jreback jreback added this to the 0.15.0 milestone Jul 17, 2014
@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

It's not completely straightforward, due to needing to handle dropna.

I think probably best approach is to add dropna to unique().

Then nunique becomes: return len(self.unique(dropna=dropna))

adding dropna to unique isn't completely trivial and I'm not sure best way to approach that.

I did initially try:
len(self.unique().dropna())

But that errored I think from a test case that uses an Index instead of a Series. Apparently Index doesn't have dropna method.

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

yes I would pass the dropna to unique, which can then simply dropna if indicated and then use unique1d which handles NA's, not sure why this change to using value_counts, as the unique machinery handles NA's properly.

@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

I'm confused, unique doesn't currently accept a dropna parameter.
Neither does unique1d.
And I don't know much about what's going on in hashtable.

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

you can add the parameter to unique (dropna=False), then nunique becomes a pass thru

then unique starts like

def unique(self, dropna=False):
    if dropna:
        return self.value_counts(dropna=dropna).index

    # the original unique

I think will work. unique can be called with an Index or a Series. Since Index doesn't have dropna then you have to use value_counts which supports it.

I think unique1d is much faster than value_counts, so its worth it to use if you don't care about dropna.

lexual added a commit to lexual/pandas that referenced this issue Aug 3, 2014
@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 9, 2014
@jreback
Copy link
Contributor

jreback commented Sep 9, 2014

@lexual hows't this coming?

@jreback
Copy link
Contributor

jreback commented Dec 23, 2014

closed by #9134

@lexual closing but If u have a further improvement in perf pls do a pull request!

@jreback jreback closed this as completed Dec 23, 2014
qwhelan pushed a commit to qwhelan/pandas that referenced this issue Jul 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

3 participants