-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
nunique is slower than len(set(x.dropna())) for smaller Series. #7771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
See a benchmark here: |
So I'm wondering if we can use len(set(x.dropna())) where the Series length is below a certain threshold. e.g. pseudo code.
|
actually this looks changed recently time this with len(unique()) this was replaced by len(value_counts()) iIIRC because that handles all dtypes properly (this is in core/base.py) |
Yes len(s.unique()) looks faster across the board, is that what nunique() is now using? |
I think it was cc @sinhrks ? |
u want I do a pr with a vbench and change to using len(unique) |
yep |
Updated my gist benchmark thing: |
It's not completely straightforward, due to needing to handle dropna. I think probably best approach is to add dropna to unique(). Then nunique becomes: return len(self.unique(dropna=dropna)) adding dropna to unique isn't completely trivial and I'm not sure best way to approach that. I did initially try: But that errored I think from a test case that uses an Index instead of a Series. Apparently Index doesn't have dropna method. |
yes I would pass the |
I'm confused, unique doesn't currently accept a dropna parameter. |
you can add the parameter to then unique starts like
I think will work. I think |
@lexual hows't this coming? |
This is biting me when applying nunique on some groupby operations.
In one benchmark, which I'll shortly share, nunique() is slower until we get to a length of 3000.
The text was updated successfully, but these errors were encountered: