Skip to content

PERF: Series.nunique can compute unique, then remove na #40865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhshadrach opened this issue Apr 10, 2021 · 4 comments · Fixed by #41236
Closed

PERF: Series.nunique can compute unique, then remove na #40865

rhshadrach opened this issue Apr 10, 2021 · 4 comments · Fixed by #41236
Labels
good first issue Performance Memory or execution speed performance Series Series data structure
Milestone

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Apr 10, 2021

Currently we first remove nans, then use len on the result of Series.unique. Except for Series that are mostly null values, it is more performant to switch the order of these operations:

n = 100_000
part_nan = 10
ser = pd.Series(n * (part_nan * [np.nan] + list(range(100)))).astype(float)

%timeit ser.nunique()
%timeit (~np.isnan(ser.unique())).sum()

gives

104 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
67 ms ± 567 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Changing part_nan to 100 gives

126 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
96.5 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

On my machine, they are about equal when part_nan is 250 (~70% null values).

@rhshadrach rhshadrach added Performance Memory or execution speed performance good first issue Series Series data structure labels Apr 10, 2021
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Apr 10, 2021
@taytzehao
Copy link
Contributor

Take

@rhshadrach
Copy link
Member Author

@taytzehao - Apologies, I should have indicated in the issue itself. This was added for a sprint today, would you be okay with finding a different issue?

@KenilMehta
Copy link
Contributor

is this issue open for contributions?

@jreback
Copy link
Contributor

jreback commented Apr 28, 2021

all issues are open for contribution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Performance Memory or execution speed performance Series Series data structure
Projects
None yet
4 participants