Skip to content

PERF: .median(axis=1) perf issues #16468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue May 23, 2017 · 5 comments · Fixed by #16509
Closed

PERF: .median(axis=1) perf issues #16468

jreback opened this issue May 23, 2017 · 5 comments · Fixed by #16509
Labels
Performance Memory or execution speed performance
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented May 23, 2017

In [2]: df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))

In [3]: result1 = df.median(1)

In [4]: result2 = pd.Series(np.nanmedian(df.values, axis=1), index=df.index)

In [5]: result1.equals(result2)
Out[5]: True

In [6]: %timeit result1 = df.median(1)
241 µs ± 4.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit df.median(1)
250 µs ± 5.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
1.77 ms ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: pd.set_option('use_bottleneck', False)

In [10]: result3 = df.median(1)

In [11]: result1.equals(result3)
Out[11]: True

In [12]: %timeit df.median(1)
317 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, if bottleneck is installed, then df.median(1) is blazingly fast. However if its NOT installed (or not used), then we fallback to np.apply_along_axis(our_median_impl), so our median impl is pretty fast itself, but it only handles 1d, so this is a pythonic loop.

To fix we can use np.nanmedian soln if available (its in >= numpy 1.9, currently we support >= 1.7).

@jreback jreback added Difficulty Intermediate Performance Memory or execution speed performance labels May 23, 2017
@jreback jreback added this to the Next Major Release milestone May 23, 2017
@rohanp
Copy link
Contributor

rohanp commented May 25, 2017

I get similar results

>>> df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))

>>> pd.set_option('use_bottleneck', False)
>>> %timeit df.median(1) 
327 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
1.83 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> pd.set_option('use_bottleneck', True)
>>> %timeit df.median(1)
239 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Why isn't bottleneck a dependency of Pandas? I didn't even know I did not have it installed until now. Even when I set pd.set_option('use_bottleneck', True), Pandas did not give me any warning that I did not have it installed.

@jreback
Copy link
Contributor Author

jreback commented May 25, 2017

see here: http://pandas.pydata.org/pandas-docs/stable/install.html#recommended-dependencies

these could be deps, but pip used to have trouble with these things and they didn't work on all platforms.

and #9422, which bottleneck changed in 1.0 (breaking the previous, IMHO correct API).

@jreback
Copy link
Contributor Author

jreback commented May 25, 2017

in any event, this is easily fixed by using np.nanmedian as I said. (which again only recently came about in last 1-2 years).

@rohanp
Copy link
Contributor

rohanp commented May 25, 2017

okay, working on the fix

@rohanp
Copy link
Contributor

rohanp commented May 25, 2017

done: #16509

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants