-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Performance improvements for nunique method. #7784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I would add the actual argument |
@jreback I must be doing something idiotic, but when I add dropna keyword-arg to unique method in base.py, and a call to unique(dropna=dropna) in nunique, I get an error like below which I don't understand:
|
their is prob a test calling |
so a dropna parameter on Categorical's unique method doesn't make sense, as it appears that a Categorical objects can't have np.nan as one of it's levels. |
just put **kwargs it just needs to be able to accept it not do anything |
I think DateTimeIndex needs it too. |
DatetimeIndex call the one in base |
@jreback Got an implementation of dropna method for you to have a look at ;) |
return values.unique() | ||
|
||
return unique1d(values) | ||
if dropna: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet it hits the AttributeError every time. values
are numpy arrays here, you have to do self.dropna()
I can confirm that the test_base.py executes all 4 paths (return statements) in that code, if that answers your query? |
Nevermind, I think I understand now ;( |
it just seems complicated. can you trace the paths? the ONLY path that has a unique for |
Returns | ||
------- | ||
result : DatetimeIndex | ||
""" | ||
result = Int64Index.unique(self) | ||
result = Int64Index.unique(self, dropna=dropna) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NaT
will not be excluded by dropna
, otherwise value_counts
excludes NaT
.
What use case is this line supporting? When would it get executed? https://github.com/pydata/pandas/blob/master/pandas/core/base.py#L291
|
@lexual that causes a dispatch for |
Isn't a call of the unique method on a Categorical going to call this code, and not run the above code at all? https://github.com/pydata/pandas/blob/master/pandas/core/categorical.py#L879
|
no, its a Series that's going to call this, which has a categorical as its values. (Sparse types also have this feature, where the
|
@sinhrks I'm sorry I don't follow your comment. With this patch, DatetimeIndex's unique appears to work correctly:
|
@lexual Thanks to confirm. I've misunderstood. |
@jreback OK, have reverted to checking hasattr instead of try/except (was thinking the whole, better to ask forgiveness, than permission thing), but this probably makes the code a little clearer. Yes all 4 paths do get exercised. I would definitely say adding dropna to index would be a good idea. It would speed up that path a lot, as it's currently very slow. And it would mean we could get rid of the 2nd path, as it would be handled by the first one. Very late here, so I'm offline until tomorrow. |
why don't u add a dropna as we'll then (their is an issue outstanding about it somewhere) but let's do this in a separate issue the index ops have a hasnan property now so this should be straightforward |
OK, now have dropna on Index, and this being used by nunique. |
@lexual going to need a couple of vbenches for this. I think their exists ones for unique so add near there. You can do a shorter one (say 3000 elements and a longer one, 100k elements). try to create so they are somewhat stable run-to-run (e.g. don't create them randomly), but just string together sub-seqequences. |
@lexual also going to needs some tests for |
@jreback might be a little while before I could find time to make these extra amends:
Cheers, |
@jreback added vbench for nunique. Probably need someone to look at, new to this tool |
------- | ||
dropped : Index | ||
""" | ||
return self[~isnull(self.values)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should just be self[~isnull(self)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests fail with that change:
raise NotImplementedError("isnull is not defined for MultiIndex")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm ok
going to need a test for each index type for dropna (except Int64) of course
an you post your vbench results when have them. |
@lexual can you rebase, how's this coming? |
Need to figure out how to run vbench, and how to share results. |
closing as stale, but issue is now at #9354 |
No description provided.