PERF: .median(axis=1) perf issues #16468

jreback · 2017-05-23T23:21:15Z

In [2]: df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))

In [3]: result1 = df.median(1)

In [4]: result2 = pd.Series(np.nanmedian(df.values, axis=1), index=df.index)

In [5]: result1.equals(result2)
Out[5]: True

In [6]: %timeit result1 = df.median(1)
241 µs ± 4.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit df.median(1)
250 µs ± 5.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
1.77 ms ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: pd.set_option('use_bottleneck', False)

In [10]: result3 = df.median(1)

In [11]: result1.equals(result3)
Out[11]: True

In [12]: %timeit df.median(1)
317 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, if bottleneck is installed, then df.median(1) is blazingly fast. However if its NOT installed (or not used), then we fallback to np.apply_along_axis(our_median_impl), so our median impl is pretty fast itself, but it only handles 1d, so this is a pythonic loop.

To fix we can use np.nanmedian soln if available (its in >= numpy 1.9, currently we support >= 1.7).

The text was updated successfully, but these errors were encountered:

rohanp · 2017-05-25T19:30:33Z

I get similar results

>>> df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))

>>> pd.set_option('use_bottleneck', False)
>>> %timeit df.median(1) 
327 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
1.83 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> pd.set_option('use_bottleneck', True)
>>> %timeit df.median(1)
239 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Why isn't bottleneck a dependency of Pandas? I didn't even know I did not have it installed until now. Even when I set pd.set_option('use_bottleneck', True), Pandas did not give me any warning that I did not have it installed.

jreback · 2017-05-25T20:48:22Z

see here: http://pandas.pydata.org/pandas-docs/stable/install.html#recommended-dependencies

these could be deps, but pip used to have trouble with these things and they didn't work on all platforms.

and #9422, which bottleneck changed in 1.0 (breaking the previous, IMHO correct API).

jreback · 2017-05-25T20:48:38Z

in any event, this is easily fixed by using np.nanmedian as I said. (which again only recently came about in last 1-2 years).

rohanp · 2017-05-25T21:22:26Z

okay, working on the fix

rohanp · 2017-05-25T22:11:07Z

done: #16509

jreback added Difficulty Intermediate Performance Memory or execution speed performance labels May 23, 2017

jreback added this to the Next Major Release milestone May 23, 2017

rohanp mentioned this issue Dec 11, 2017

PERF: optimized median func when bottleneck not present #16509

Merged

jreback modified the milestones: Next Major Release, 0.23.0 Jan 21, 2018

jreback closed this as completed in #16509 Jan 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: .median(axis=1) perf issues #16468

PERF: .median(axis=1) perf issues #16468

jreback commented May 23, 2017

rohanp commented May 25, 2017 •

edited

Loading

jreback commented May 25, 2017

jreback commented May 25, 2017 •

edited

Loading

rohanp commented May 25, 2017

rohanp commented May 25, 2017

PERF: .median(axis=1) perf issues #16468

PERF: .median(axis=1) perf issues #16468

Comments

jreback commented May 23, 2017

rohanp commented May 25, 2017 • edited Loading

jreback commented May 25, 2017

jreback commented May 25, 2017 • edited Loading

rohanp commented May 25, 2017

rohanp commented May 25, 2017

rohanp commented May 25, 2017 •

edited

Loading

jreback commented May 25, 2017 •

edited

Loading