Skip to content

Performance improvements for nunique method. #7784

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

lexual
Copy link
Contributor

@lexual lexual commented Jul 18, 2014

No description provided.

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

I would add the actual argument dropna to unique

@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

@jreback I must be doing something idiotic, but when I add dropna keyword-arg to unique method in base.py, and a call to unique(dropna=dropna) in nunique, I get an error like below which I don't understand:

TypeError: unique() got an unexpected keyword argument 'dropna'

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

their is prob a test calling core/categorical/unique which needs this argument too

@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

so a dropna parameter on Categorical's unique method doesn't make sense, as it appears that a Categorical objects can't have np.nan as one of it's levels.

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

just put **kwargs

it just needs to be able to accept it not do anything

@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

I think DateTimeIndex needs it too.

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

DatetimeIndex call the one in base

@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

@jreback Got an implementation of dropna method for you to have a look at ;)

return values.unique()

return unique1d(values)
if dropna:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet it hits the AttributeError every time. values are numpy arrays here, you have to do self.dropna()

@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

I can confirm that the test_base.py executes all 4 paths (return statements) in that code, if that answers your query?

@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

Nevermind, I think I understand now ;(

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

it just seems complicated. can you trace the paths? the ONLY path that has a unique for .values is Categorical. Maybe enumerate the paths and test for speed. A side issue, could add dropna method to Index as well (in core/base). Its actually pretty easy.

Returns
-------
result : DatetimeIndex
"""
result = Int64Index.unique(self)
result = Int64Index.unique(self, dropna=dropna)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaT will not be excluded by dropna, otherwise value_counts excludes NaT.

@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

What use case is this line supporting? When would it get executed?

https://github.com/pydata/pandas/blob/master/pandas/core/base.py#L291

if hasattr(values,'unique'):
    return values.unique()

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

@lexual that causes a dispatch for Categorical as self.values are actually a Categorical (and NOT an ndarray).

@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

Isn't a call of the unique method on a Categorical going to call this code, and not run the above code at all?

https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L879

def unique(self):
    return self.levels

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

no, its a Series that's going to call this, which has a categorical as its values.

(Sparse types also have this feature, where the .values is NOT an ndarray).

In [1]:  Series(pd.Categorical([1,2,3,4]))
Out[1]: 
0    1
1    2
2    3
3    4
dtype: category
Levels (4, int64): [1 < 2 < 3 < 4]

In [2]:  Series(pd.Categorical([1,2,3,4])).unique()
Out[2]: Int64Index([1, 2, 3, 4], dtype='int64')

In [3]:  Series(pd.Categorical([1,2,3,4])).values
Out[3]: 
 1
 2
 3
 4
Levels (4, int64): [1 < 2 < 3 < 4]

In [4]: type(Series(pd.Categorical([1,2,3,4])).values)
Out[4]: pandas.core.categorical.Categorical

@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

@sinhrks I'm sorry I don't follow your comment.

With this patch, DatetimeIndex's unique appears to work correctly:

i = pd.DatetimeIndex(['2014-01-01', pd.NaT])
assert len(i.unique(dropna=True)) == 1
assert len(i.unique(dropna=False)) == 2

@sinhrks
Copy link
Member

sinhrks commented Jul 18, 2014

@lexual Thanks to confirm. I've misunderstood.

@lexual
Copy link
Contributor Author

lexual commented Jul 18, 2014

@jreback OK, have reverted to checking hasattr instead of try/except (was thinking the whole, better to ask forgiveness, than permission thing), but this probably makes the code a little clearer.

Yes all 4 paths do get exercised.

I would definitely say adding dropna to index would be a good idea. It would speed up that path a lot, as it's currently very slow. And it would mean we could get rid of the 2nd path, as it would be handled by the first one.

Very late here, so I'm offline until tomorrow.

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

why don't u add a dropna as we'll then (their is an issue outstanding about it somewhere)

but let's do this in a separate issue

the index ops have a hasnan property now so this should be straightforward

lexual added a commit to lexual/pandas that referenced this pull request Jul 19, 2014
@lexual
Copy link
Contributor Author

lexual commented Jul 19, 2014

OK, now have dropna on Index, and this being used by nunique.

lexual added a commit to lexual/pandas that referenced this pull request Jul 19, 2014
@jreback
Copy link
Contributor

jreback commented Jul 19, 2014

@lexual going to need a couple of vbenches for this. I think their exists ones for unique so add near there. You can do a shorter one (say 3000 elements and a longer one, 100k elements). try to create so they are somewhat stable run-to-run (e.g. don't create them randomly), but just string together sub-seqequences.

@jreback
Copy link
Contributor

jreback commented Jul 19, 2014

@lexual also going to needs some tests for dropna for DatetimeIndex/PeriodIndex. Note that these already have some machinery in place, so you may need to redefine it in (core/base/DatetimeOpsMixin); see hasnans as well

@lexual
Copy link
Contributor Author

lexual commented Jul 21, 2014

@jreback might be a little while before I could find time to make these extra amends:

  • Never used vbench before, so need to figure how to install/run/write before I even get started.
  • Crazy busy for probably next week at least.

Cheers,

@jreback
Copy link
Contributor

jreback commented Jul 21, 2014

@lexual no problem. Ok let's leave this in place.

@sinhrks do you want to take for dropna into base? (test and impl)?

@jreback jreback added this to the 0.15.0 milestone Jul 21, 2014
@lexual
Copy link
Contributor Author

lexual commented Aug 3, 2014

@jreback added vbench for nunique. Probably need someone to look at, new to this tool

-------
dropped : Index
"""
return self[~isnull(self.values)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should just be self[~isnull(self)]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests fail with that change:

raise NotImplementedError("isnull is not defined for MultiIndex")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok

going to need a test for each index type for dropna (except Int64) of course

@jreback
Copy link
Contributor

jreback commented Aug 4, 2014

an you post your vbench results when have them.

@jreback
Copy link
Contributor

jreback commented Aug 10, 2014

@lexual can you rebase, how's this coming?

@lexual
Copy link
Contributor Author

lexual commented Aug 11, 2014

Need to figure out how to run vbench, and how to share results.
Also figure out how to get tests written for DatetimeIndex/PeriodIndex.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

closing as stale, but issue is now at #9354

@jreback jreback closed this Jan 25, 2015
qwhelan pushed a commit to qwhelan/pandas that referenced this pull request Jul 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants