nunique is slower than len(set(x.dropna())) for smaller Series. #7771

lexual · 2014-07-17T02:52:42Z

This is biting me when applying nunique on some groupby operations.

In one benchmark, which I'll shortly share, nunique() is slower until we get to a length of 3000.

lexual · 2014-07-17T03:00:50Z

See a benchmark here:

https://gist.github.com/lexual/118084593a98aa472289

lexual · 2014-07-17T03:03:58Z

So I'm wondering if we can use len(set(x.dropna())) where the Series length is below a certain threshold.

e.g. pseudo code.

def nunique(s):
    threshhold = 3000
    if len(s) < threshold:
        return len(set(s.dropna()))
    else:
        return current_nunique_implementation(s)

jreback · 2014-07-17T03:09:12Z

actually this looks changed recently

time this with

len(unique())

this was replaced by len(value_counts())

iIIRC because that handles all dtypes properly (this is in core/base.py)

lexual · 2014-07-17T03:20:08Z

Yes len(s.unique()) looks faster across the board, is that what nunique() is now using?

jreback · 2014-07-17T03:22:27Z

I think it was
was changed somewhat recently

cc @sinhrks ?

lexual · 2014-07-17T03:24:28Z

https://github.com/pydata/pandas/blob/master/pandas/core/base.py#L295

?

jreback · 2014-07-17T03:25:11Z

u want I do a pr with a vbench and change to using len(unique)
(also need to handle dropna argument)
FYI this is used by series and index which doesn't have dropna (that may be why value_counts is used(

jreback · 2014-07-17T03:25:49Z

yep

lexual · 2014-07-17T03:34:28Z

Updated my gist benchmark thing:

https://gist.github.com/lexual/118084593a98aa472289

sinhrks · 2014-07-17T08:49:01Z

nunique had been used value_counts prior to #6734, thus I'm not sure the actual background. it may be a reason that unique and value_counts handles NaT differently (fixed in #7424).

I agree len(series.unique()) looks better.

jreback · 2014-07-17T09:58:31Z

@sinhrks yep the NaT treatment might be slightly different (so maybe need to fix that!!).

@lexual since you brought this up, wnt to do a PR to revert back to using unique, and see if anything pops up?

lexual · 2014-07-18T04:35:32Z

It's not completely straightforward, due to needing to handle dropna.

I think probably best approach is to add dropna to unique().

Then nunique becomes: return len(self.unique(dropna=dropna))

adding dropna to unique isn't completely trivial and I'm not sure best way to approach that.

I did initially try:
len(self.unique().dropna())

But that errored I think from a test case that uses an Index instead of a Series. Apparently Index doesn't have dropna method.

jreback · 2014-07-18T11:52:28Z

yes I would pass the dropna to unique, which can then simply dropna if indicated and then use unique1d which handles NA's, not sure why this change to using value_counts, as the unique machinery handles NA's properly.

lexual · 2014-07-18T12:03:32Z

I'm confused, unique doesn't currently accept a dropna parameter.
Neither does unique1d.
And I don't know much about what's going on in hashtable.

jreback · 2014-07-18T12:09:45Z

you can add the parameter to unique (dropna=False), then nunique becomes a pass thru

then unique starts like

def unique(self, dropna=False):
    if dropna:
        return self.value_counts(dropna=dropna).index

    # the original unique

I think will work. unique can be called with an Index or a Series. Since Index doesn't have dropna then you have to use value_counts which supports it.

I think unique1d is much faster than value_counts, so its worth it to use if you don't care about dropna.

…method. ref pandas-dev#7771

jreback · 2014-09-09T23:35:01Z

@lexual hows't this coming?

jreback · 2014-12-23T03:25:29Z

closed by #9134

@lexual closing but If u have a further improvement in perf pls do a pull request!

…method. ref pandas-dev#7771

jreback added Dtypes labels Jul 17, 2014

jreback added this to the 0.15.0 milestone Jul 17, 2014

lexual added a commit to lexual/pandas that referenced this issue Aug 3, 2014

dropna added for unique method. Performance improvements for nunique …

0e17681

…method. ref pandas-dev#7771

jreback modified the milestones: 0.15.1, 0.15.0 Sep 9, 2014

jreback mentioned this issue Dec 22, 2014

PERF: use unique and isnull in nunique instead of value_counts. #9134

Merged

jreback closed this as completed Dec 23, 2014

qwhelan pushed a commit to qwhelan/pandas that referenced this issue Jul 28, 2015

dropna added for unique method. Performance improvements for nunique …

27f4f45

…method. ref pandas-dev#7771

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nunique is slower than len(set(x.dropna())) for smaller Series. #7771

nunique is slower than len(set(x.dropna())) for smaller Series. #7771

lexual commented Jul 17, 2014

lexual commented Jul 17, 2014

lexual commented Jul 17, 2014

jreback commented Jul 17, 2014

lexual commented Jul 17, 2014

jreback commented Jul 17, 2014

lexual commented Jul 17, 2014

jreback commented Jul 17, 2014

jreback commented Jul 17, 2014

lexual commented Jul 17, 2014

sinhrks commented Jul 17, 2014

jreback commented Jul 17, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

jreback commented Sep 9, 2014

jreback commented Dec 23, 2014

nunique is slower than len(set(x.dropna())) for smaller Series. #7771

nunique is slower than len(set(x.dropna())) for smaller Series. #7771

Comments

lexual commented Jul 17, 2014

lexual commented Jul 17, 2014

lexual commented Jul 17, 2014

jreback commented Jul 17, 2014

lexual commented Jul 17, 2014

jreback commented Jul 17, 2014

lexual commented Jul 17, 2014

jreback commented Jul 17, 2014

jreback commented Jul 17, 2014

lexual commented Jul 17, 2014

sinhrks commented Jul 17, 2014

jreback commented Jul 17, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

jreback commented Sep 9, 2014

jreback commented Dec 23, 2014