PERF: Series.nunique can compute unique, then remove na #40865

rhshadrach · 2021-04-10T12:59:10Z

Currently we first remove nans, then use len on the result of Series.unique. Except for Series that are mostly null values, it is more performant to switch the order of these operations:

n = 100_000
part_nan = 10
ser = pd.Series(n * (part_nan * [np.nan] + list(range(100)))).astype(float)

%timeit ser.nunique()
%timeit (~np.isnan(ser.unique())).sum()

gives

104 ms ± 273 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
67 ms ± 567 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Changing part_nan to 100 gives

126 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
96.5 ms ± 431 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

On my machine, they are about equal when part_nan is 250 (~70% null values).

The text was updated successfully, but these errors were encountered:

taytzehao · 2021-04-10T14:04:45Z

Take

rhshadrach · 2021-04-10T14:23:49Z

@taytzehao - Apologies, I should have indicated in the issue itself. This was added for a sprint today, would you be okay with finding a different issue?

KenilMehta · 2021-04-28T06:49:20Z

is this issue open for contributions?

jreback · 2021-04-28T11:28:55Z

all issues are open for contribution

…#41236)

rhshadrach added Performance Memory or execution speed performance good first issue Series Series data structure labels Apr 10, 2021

rhshadrach added this to the Contributions Welcome milestone Apr 10, 2021

github-actions bot assigned taytzehao Apr 10, 2021

jorisvandenbossche mentioned this issue Apr 10, 2021

Improve the performance of Series.nunique MarcoGorelli/pyladieslondon-sprints#35

Closed

rhshadrach unassigned taytzehao Apr 10, 2021

KenilMehta pushed a commit to KenilMehta/pandas that referenced this issue Apr 28, 2021

PERF: Optimising Series.nunique() for NaN values pandas-dev#40865

7a13218

KenilMehta mentioned this issue Apr 29, 2021

Optimising Series.nunique for Nan values #40865 #41218

Closed

1 task

KenilMehta pushed a commit to KenilMehta/pandas that referenced this issue Apr 29, 2021

PERF: Optimising Series.nunique() for NaN values pandas-dev#40865

f58048d

KenilMehta mentioned this issue Apr 30, 2021

Optimising Series.nunique for Nan values #40865 #41236

Merged

1 task

jreback modified the milestones: Contributions Welcome, 1.3 May 2, 2021

jreback closed this as completed in #41236 May 3, 2021

jreback pushed a commit that referenced this issue May 3, 2021

Optimising Series.nunique for Nan values #40865 (#41236)

c61e66e

yeshsurya pushed a commit to yeshsurya/pandas that referenced this issue May 6, 2021

Optimising Series.nunique for Nan values pandas-dev#40865 (pandas-dev…

175209d

…#41236)

JulianWgs pushed a commit to JulianWgs/pandas that referenced this issue Jul 3, 2021

Optimising Series.nunique for Nan values pandas-dev#40865 (pandas-dev…

0e44192

…#41236)

TendouArisu mentioned this issue Feb 16, 2025

Potential performance issue: Unreliable performance of pd.Series.nunique in pandas 1.2.4 Priya22/EmotionDynamics#4

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: Series.nunique can compute unique, then remove na #40865

PERF: Series.nunique can compute unique, then remove na #40865

rhshadrach commented Apr 10, 2021 •

edited

Loading

taytzehao commented Apr 10, 2021

Uh oh!

rhshadrach commented Apr 10, 2021

Uh oh!

KenilMehta commented Apr 28, 2021

Uh oh!

jreback commented Apr 28, 2021

Uh oh!

Uh oh!

PERF: Series.nunique can compute unique, then remove na #40865

PERF: Series.nunique can compute unique, then remove na #40865

Comments

rhshadrach commented Apr 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

taytzehao commented Apr 10, 2021

Uh oh!

rhshadrach commented Apr 10, 2021

Uh oh!

KenilMehta commented Apr 28, 2021

Uh oh!

jreback commented Apr 28, 2021

Uh oh!

rhshadrach commented Apr 10, 2021 •

edited

Loading