Performance improvements for nunique method. #7784

lexual · 2014-07-18T13:09:30Z

No description provided.

jreback · 2014-07-18T13:11:38Z

I would add the actual argument dropna to unique

lexual · 2014-07-18T13:14:47Z

@jreback I must be doing something idiotic, but when I add dropna keyword-arg to unique method in base.py, and a call to unique(dropna=dropna) in nunique, I get an error like below which I don't understand:

TypeError: unique() got an unexpected keyword argument 'dropna'

jreback · 2014-07-18T13:22:35Z

their is prob a test calling core/categorical/unique which needs this argument too

lexual · 2014-07-18T13:29:13Z

so a dropna parameter on Categorical's unique method doesn't make sense, as it appears that a Categorical objects can't have np.nan as one of it's levels.

jreback · 2014-07-18T13:30:18Z

just put **kwargs

it just needs to be able to accept it not do anything

lexual · 2014-07-18T13:32:52Z

I think DateTimeIndex needs it too.

jreback · 2014-07-18T13:34:51Z

DatetimeIndex call the one in base

lexual · 2014-07-18T13:57:35Z

@jreback Got an implementation of dropna method for you to have a look at ;)

jreback · 2014-07-18T14:04:23Z

pandas/core/base.py

-            return values.unique()
-
-        return unique1d(values)
+        if dropna:


I bet it hits the AttributeError every time. values are numpy arrays here, you have to do self.dropna()

lexual · 2014-07-18T14:13:17Z

I can confirm that the test_base.py executes all 4 paths (return statements) in that code, if that answers your query?

lexual · 2014-07-18T14:15:16Z

Nevermind, I think I understand now ;(

jreback · 2014-07-18T14:15:18Z

it just seems complicated. can you trace the paths? the ONLY path that has a unique for .values is Categorical. Maybe enumerate the paths and test for speed. A side issue, could add dropna method to Index as well (in core/base). Its actually pretty easy.

sinhrks · 2014-07-18T14:18:56Z

pandas/tseries/index.py

        Returns
        -------
        result : DatetimeIndex
        """
-        result = Int64Index.unique(self)
+        result = Int64Index.unique(self, dropna=dropna)


NaT will not be excluded by dropna, otherwise value_counts excludes NaT.

lexual · 2014-07-18T14:29:03Z

What use case is this line supporting? When would it get executed?

https://github.com/pydata/pandas/blob/master/pandas/core/base.py#L291

if hasattr(values,'unique'):
    return values.unique()

jreback · 2014-07-18T14:30:35Z

@lexual that causes a dispatch for Categorical as self.values are actually a Categorical (and NOT an ndarray).

lexual · 2014-07-18T14:36:53Z

Isn't a call of the unique method on a Categorical going to call this code, and not run the above code at all?

https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L879

def unique(self):
    return self.levels

jreback · 2014-07-18T14:42:36Z

no, its a Series that's going to call this, which has a categorical as its values.

(Sparse types also have this feature, where the .values is NOT an ndarray).

In [1]:  Series(pd.Categorical([1,2,3,4]))
Out[1]: 
0    1
1    2
2    3
3    4
dtype: category
Levels (4, int64): [1 < 2 < 3 < 4]

In [2]:  Series(pd.Categorical([1,2,3,4])).unique()
Out[2]: Int64Index([1, 2, 3, 4], dtype='int64')

In [3]:  Series(pd.Categorical([1,2,3,4])).values
Out[3]: 
 1
 2
 3
 4
Levels (4, int64): [1 < 2 < 3 < 4]

In [4]: type(Series(pd.Categorical([1,2,3,4])).values)
Out[4]: pandas.core.categorical.Categorical

lexual · 2014-07-18T14:58:53Z

@sinhrks I'm sorry I don't follow your comment.

With this patch, DatetimeIndex's unique appears to work correctly:

i = pd.DatetimeIndex(['2014-01-01', pd.NaT])
assert len(i.unique(dropna=True)) == 1
assert len(i.unique(dropna=False)) == 2

sinhrks · 2014-07-18T15:24:50Z

@lexual Thanks to confirm. I've misunderstood.

lexual · 2014-07-18T15:51:42Z

@jreback OK, have reverted to checking hasattr instead of try/except (was thinking the whole, better to ask forgiveness, than permission thing), but this probably makes the code a little clearer.

Yes all 4 paths do get exercised.

I would definitely say adding dropna to index would be a good idea. It would speed up that path a lot, as it's currently very slow. And it would mean we could get rid of the 2nd path, as it would be handled by the first one.

Very late here, so I'm offline until tomorrow.

jreback · 2014-07-18T15:59:06Z

why don't u add a dropna as we'll then (their is an issue outstanding about it somewhere)

but let's do this in a separate issue

the index ops have a hasnan property now so this should be straightforward

ref pandas-dev#6194 ref pandas-dev#7784

lexual · 2014-07-19T00:13:05Z

OK, now have dropna on Index, and this being used by nunique.

ref pandas-dev#6194 ref pandas-dev#7784

jreback · 2014-07-19T14:04:11Z

@lexual going to need a couple of vbenches for this. I think their exists ones for unique so add near there. You can do a shorter one (say 3000 elements and a longer one, 100k elements). try to create so they are somewhat stable run-to-run (e.g. don't create them randomly), but just string together sub-seqequences.

jreback · 2014-07-19T14:05:18Z

@lexual also going to needs some tests for dropna for DatetimeIndex/PeriodIndex. Note that these already have some machinery in place, so you may need to redefine it in (core/base/DatetimeOpsMixin); see hasnans as well

lexual · 2014-07-21T08:22:35Z

@jreback might be a little while before I could find time to make these extra amends:

Never used vbench before, so need to figure how to install/run/write before I even get started.
Crazy busy for probably next week at least.

Cheers,

jreback · 2014-07-21T11:46:58Z

@lexual no problem. Ok let's leave this in place.

@sinhrks do you want to take for dropna into base? (test and impl)?

ref pandas-dev#6194 ref pandas-dev#7784

…method. ref pandas-dev#7771

lexual · 2014-08-03T01:03:42Z

@jreback added vbench for nunique. Probably need someone to look at, new to this tool

jreback · 2014-08-03T01:38:15Z

pandas/core/index.py

+        -------
+        dropped : Index
+        """
+        return self[~isnull(self.values)]


should just be self[~isnull(self)]

tests fail with that change:

raise NotImplementedError("isnull is not defined for MultiIndex")

hmm ok

going to need a test for each index type for dropna (except Int64) of course

jreback · 2014-08-04T16:10:28Z

an you post your vbench results when have them.

jreback · 2014-08-10T14:56:56Z

@lexual can you rebase, how's this coming?

lexual · 2014-08-11T07:52:30Z

Need to figure out how to run vbench, and how to share results.
Also figure out how to get tests written for DatetimeIndex/PeriodIndex.

jreback · 2015-01-25T23:34:09Z

closing as stale, but issue is now at #9354

ref pandas-dev#6194 ref pandas-dev#7784

jreback reviewed Jul 18, 2014
View reviewed changes

sinhrks reviewed Jul 18, 2014
View reviewed changes

lexual added a commit to lexual/pandas that referenced this pull request Jul 19, 2014

dropna method added to Index.

13f1022

ref pandas-dev#6194 ref pandas-dev#7784

lexual mentioned this pull request Jul 19, 2014

dropna method added to Index. #7799

Closed

lexual added a commit to lexual/pandas that referenced this pull request Jul 19, 2014

dropna method added to Index.

e31a057

ref pandas-dev#6194 ref pandas-dev#7784

jreback added Performance labels Jul 19, 2014

jreback added this to the 0.15.0 milestone Jul 21, 2014

lexual added 3 commits August 3, 2014 10:57

dropna method added to Index.

8856b24

ref pandas-dev#6194 ref pandas-dev#7784

dropna added for unique method. Performance improvements for nunique …

0e17681

…method. ref pandas-dev#7771

vbench for nunique

1abfc53

jreback reviewed Aug 3, 2014
View reviewed changes

jreback modified the milestones: 0.15.1, 0.15.0 Sep 14, 2014

jreback mentioned this pull request Jan 25, 2015

PERF: nunique perf can be improved by using len(unique) rather than value_counts #9354

Closed

jreback closed this Jan 25, 2015

qwhelan pushed a commit to qwhelan/pandas that referenced this pull request Jul 28, 2015

dropna method added to Index.

7ea39cc

ref pandas-dev#6194 ref pandas-dev#7784

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements for nunique method. #7784

Performance improvements for nunique method. #7784

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback Jul 18, 2014

lexual commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

sinhrks Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

sinhrks commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 19, 2014

jreback commented Jul 19, 2014

jreback commented Jul 19, 2014

lexual commented Jul 21, 2014

jreback commented Jul 21, 2014

lexual commented Aug 3, 2014

jreback Aug 3, 2014

lexual Aug 3, 2014

jreback Aug 3, 2014

jreback commented Aug 4, 2014

jreback commented Aug 10, 2014

lexual commented Aug 11, 2014

jreback commented Jan 25, 2015

Performance improvements for nunique method. #7784

Performance improvements for nunique method. #7784

Conversation

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback Jul 18, 2014

Choose a reason for hiding this comment

lexual commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

sinhrks Jul 18, 2014

Choose a reason for hiding this comment

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 18, 2014

sinhrks commented Jul 18, 2014

lexual commented Jul 18, 2014

jreback commented Jul 18, 2014

lexual commented Jul 19, 2014

jreback commented Jul 19, 2014

jreback commented Jul 19, 2014

lexual commented Jul 21, 2014

jreback commented Jul 21, 2014

lexual commented Aug 3, 2014

jreback Aug 3, 2014

Choose a reason for hiding this comment

lexual Aug 3, 2014

Choose a reason for hiding this comment

jreback Aug 3, 2014

Choose a reason for hiding this comment

jreback commented Aug 4, 2014

jreback commented Aug 10, 2014

lexual commented Aug 11, 2014

jreback commented Jan 25, 2015