PERF: optimized median func when bottleneck not present #16509

rohanp · 2017-05-25T22:10:38Z

Uses np.nanmedian to compute median instead of internal algos.median when bottleneck is not present/turned off. Has order of magnitude performance benefits

df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))

Before

>>> pd.set_option('use_bottleneck', False)
>>> %timeit df.median(1)
318 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
1.83 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> pd.set_option('use_bottleneck', True)
>>> %timeit df.median(1)
239 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After

>>> pd.set_option('use_bottleneck', False)
>>> %timeit df.median(1)
1.89 ms ± 53.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> pd.set_option('use_bottleneck', True)
>>> %timeit df.median(1)
227 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

jreback · 2017-05-25T22:11:19Z

pandas/core/nanops.py

    values, mask, dtype, dtype_max = _get_values(values, skipna)

-    def get_median(x):


and this will break in numpy < 1.9, you need to do this conditionally.

jreback · 2017-05-25T22:11:43Z

needs an asv (run with and w/o bottleneck)

rohanp · 2017-05-25T22:36:49Z

Do i need to run all the tests or can i get away with a subset?

rohanp · 2017-05-25T22:37:09Z

Also do I have to manually turn bottleneck on and off for the tests?

jreback · 2017-05-25T22:39:33Z

Also do I have to manually turn bottleneck on and off for the tests?

have 2 tests, 1 with it forced on, 1 with forced off

Do i need to run all the tests or can i get away with a subset?

just the median tests :>

jreback · 2017-05-25T22:40:24Z

actually I don't think bottleneck is in our asv config. (separate issue is it should be)

rohanp · 2017-05-25T23:13:02Z

just the median tests :>

does that just entail test_nanops.py?

TomAugspurger · 2017-05-26T12:37:52Z

@rohanp asv continuous -f 1.1 upstream/master HEAD -b median should work. See http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite

rohanp · 2017-05-27T20:55:07Z

I don't understand why the tests are failing. How do I see the test cases that failed?

jreback · 2017-05-27T22:35:34Z

click on the red x

rohanp · 2017-05-29T02:24:49Z

I'm not quite sure what you mean. I figured out how to see the details of the tests that failed; for example on circleci I can see

self = <pandas.tests.test_nanops.TestnanopsDataFrame object at 0x7f67d2d65c88>

    def test_nanmedian(self):
        with warnings.catch_warnings(record=True):
            self.check_funs(nanops.nanmedian, np.median, allow_complex=False,
                            allow_str=False, allow_date=False,
>                           allow_tdelta=True, allow_obj='convert')

pandas/tests/test_nanops.py:374: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/test_nanops.py:245: in check_funs
    **kwargs)
pandas/tests/test_nanops.py:234: in check_fun
    targarnanval, **kwargs)
pandas/tests/test_nanops.py:193: in check_fun_data
    check_dtype=check_dtype)
pandas/tests/test_nanops.py:146: in check_results
    tm.assert_almost_equal(targ, res, check_dtype=check_dtype)
pandas/util/testing.py:234: in assert_almost_equal
    **kwargs)
pandas/_libs/testing.pyx:59: in pandas._libs.testing.assert_almost_equal (pandas/_libs/testing.c:4156)
    cpdef assert_almost_equal(a, b,
pandas/_libs/testing.pyx:173: in pandas._libs.testing.assert_almost_equal (pandas/_libs/testing.c:3274)
    raise_assert_detail(obj, msg, lobj, robj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

obj = 'numpy array', message = 'numpy array values are different (20.0 %)'
left = '[[nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan]]'
right = '[[0.0511871111488, -0.458724351947, -0.0208572588354, 0.00471620824214, -0.487450340243], [0.254176310551, -0.0309721..., -0.2722898607, 0.174928469595], [-0.431481894512, 0.245270844279, 0.140055994353, -0.0727601005076, 0.238682362073]]'

But what I do not understand is how to find the input that caused the test to fail. I checked the test_nanops.py file and while it has the tests it does not seem to have the test cases.

TomAugspurger · 2017-05-30T15:15:38Z

@rohanp these tests have several layers of indirection (to allow more code reuse). I would try pytest pandsa/tests/test_nanops.py -k test_nanmedian --pdb to drop into a debugger on failure so you can inspect what's going on. But the basic idea is test_nanmedian -> check_funs -> check_fun -> check_fun_data -> check_results, which is failing for one of the input array.

The actual arrays will be something like self.arr_float or arr_float_nan and are accessed with getattr.

jreback · 2017-06-10T19:08:50Z

can you rebase and update?

jreback · 2017-07-07T10:37:55Z

can you rebase and update

jreback · 2017-07-19T10:35:20Z

can you rebase / update according to comments

pep8speaks · 2017-08-08T07:05:25Z

Hello @rohanp! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 21, 2018 at 21:03 Hours UTC

rohanp · 2017-08-08T07:15:21Z

Hi, sorry for the delay. I think I figured out the bug. Thanks @TomAugspurger for the debugging command. How do I rebase and update?

jorisvandenbossche · 2017-08-08T07:41:20Z

@rohanp you can do the following (assuming that pandas-dev/pandas remote repo is called 'upstream')

git fetch upstream
git checkout rohanp
git merge upstream/master # after this you can fix merge conflicts, and do 'git commit' to continue
git push origin rohanp

jorisvandenbossche · 2017-08-08T07:42:44Z

pandas/core/nanops.py

@@ -36,6 +37,7 @@ def set_use_bottleneck(v=True):
        _USE_BOTTLENECK = v


+


you can remove this blank line? (in general, please try to avoid such changes on lines that you are not changing for this PR)

jorisvandenbossche · 2017-08-08T07:43:19Z

pandas/core/nanops.py

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
 @bottleneck_switch()
 def nanmedian(values, axis=None, skipna=True):

+    if not skipna and isnull(np.sum(values)):
+        return np.median(values, axis=axis)


Is this special case needed? As np.nanmedian is almost as fast as np.median, also when there are no NaNs.

this is not needed

@rohanp Can you deal with this comment? (remove or give some explanation on why you think it is needed)

BTW, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now.

jreback

need a whatsnew note in performance section.

jreback · 2017-08-08T11:00:14Z

pandas/core/nanops.py

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
 @bottleneck_switch()
 def nanmedian(values, axis=None, skipna=True):

+    if not skipna and isnull(np.sum(values)):
+        return np.median(values, axis=axis)
+
    values, mask, dtype, dtype_max = _get_values(values, skipna)


use _np_version_under1p19 which are already defined

then define

if _np_under_version1p19: agg1d = get_median aggaxis = lambda values, axis: np.apply_along_axis(get_median, axis, values) else: agg1d = np.nanmedian aggaxis = lambda values, axis: np.nanmedian(values, axis)

then the code is almost the same as before, just replace with agg1d and aggaxis

jreback · 2017-08-08T11:01:07Z

pandas/core/nanops.py

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
 @bottleneck_switch()
 def nanmedian(values, axis=None, skipna=True):

+    if not skipna and isnull(np.sum(values)):
+        return np.median(values, axis=axis)


this is not needed

jorisvandenbossche

update looks good, please have a look at the remaining comment

Further:

can you add a whatsnew note? (in performance improvements section)

jorisvandenbossche · 2017-08-09T11:31:12Z

pandas/core/nanops.py

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
 @bottleneck_switch()
 def nanmedian(values, axis=None, skipna=True):

+    if not skipna and isnull(np.sum(values)):
+        return np.median(values, axis=axis)


@rohanp Can you deal with this comment? (remove or give some explanation on why you think it is needed)

BTW, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now.

jorisvandenbossche · 2017-08-09T11:31:28Z

pandas/tests/test_nanops.py

@@ -189,6 +189,10 @@ def check_fun_data(self, testfunc, targfunc, testarval, targarval,
                    targ = targfunc(targartempval, axis=axis, **kwargs)
                    res = testfunc(testarval, axis=axis, skipna=skipna,
                                   **kwargs)
+
+                    #if not np.allclose(targ, res):
+                    #    import pdb; pdb.set_trace()


leftover to remove

rohanp · 2017-08-09T15:02:30Z

Hi @jorisvandenbossche, my changes are causing the tests to fail. I'm still looking into why. I'll remove that once I get to the bottom of what the problem is.

jorisvandenbossche · 2017-08-10T08:35:01Z

my changes are causing the tests to fail. I'm still looking into why

see my comment above, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now (but, we also asked to remove the part that uses isnull (or explain why it is needed))

jreback

couple of things to remove. pls add a whatsnew note (in perf). ping on green.

jreback · 2017-08-12T18:00:08Z

pandas/core/nanops.py

@@ -342,6 +343,9 @@ def nanmean(values, axis=None, skipna=True):
 @bottleneck_switch()
 def nanmedian(values, axis=None, skipna=True):

+    #if False and not skipna and isna(np.sum(values)):


rohanp · 2017-08-12T18:02:44Z

Hi, I had put that in because the tests were failing and I thought that would fix it (but apparently not). The tests are still failing without it.

rohanp · 2017-08-19T05:18:42Z

@jorisvandenbossche thank you for that catch! I've now double checked that nothing in the diff has changed except for the contents of nanmedian, and the tests are passing on my local computer, but some of the same errors are popping up on the CI tests.

pandas/util/testing.py:1105: AssertionError
___________________________ TestPanel4d.test_median ____________________________

self = <pandas.tests.test_panel4d.TestPanel4d object at 0x7f3358785e50>

def test_median(self):
    def wrapper(x):
        if isna(x).any():
            return np.nan
        return np.median(x)

  self._check_stat_op('median', wrapper)

pandas/tests/test_panel4d.py:54:

jreback · 2017-08-19T17:24:03Z

pandas/core/nanops.py

@@ -344,12 +344,22 @@ def nanmedian(values, axis=None, skipna=True):

    values, mask, dtype, dtype_max = _get_values(values, skipna)

+    if not skipna:


why are you adding this?

The tests compare nanmedian to numpy's median function. If this special case were not there, this function would return a non-null result for input with null values and skipna turned off. Here is a debug log demonstrating this: https://gist.github.com/rohanp/1082ad5a5e199f6b1cb8ade965c4519f

so fix the test to do the correct comparison

Isn't the desired behavior for nanmedian to be like median when skipna is turned off though? If not, then what is the skipna flag supposed to do?

yes, so use the skipna flag as a switch when generating which function to use.

rohanp · 2017-08-22T07:29:56Z

It looks like the tests are passing on when I run pytest pandas locally as well as on AppVeyor, but failing on Travis CI and CircleCI.

Could someone please explain how I can replicate the environment for CircleCI on my local computer so that I can debug? I tried looking in the ci folder but am not sure what I should be looking for (a conda environment.yml?)

gfyoung · 2017-08-22T07:35:00Z

The Travis failures are style-checking errors, which you can see here:

https://travis-ci.org/pandas-dev/pandas/jobs/267057585#L1937

Address those lint issues, and you should get a green-light from Travis.

The Circle failures are coming from Python 2.7. Are you using Python 3.x to test your changes? If so, then you should create a Python 2.x environment (either with virtualenv or conda) in which you should install your changes and test.

jorisvandenbossche · 2017-08-23T11:25:16Z

@rohanp The skipna_switch feels very complex to me. It is not possible to achieve the same by doing some switch within the nanmedian function?

You still have failing median tests on Circle CI. Are you able to reproduce that locally?

jreback · 2017-08-23T12:05:46Z

@rohanp yeah I would go back to the previous iteration. This needs to use the version detection as well (you had previously)

jorisvandenbossche · 2017-08-23T12:11:43Z

This needs to use the version detection as well (you had previously)

why? it was for numpy < 1.9, which we just removed?

rohanp · 2017-08-23T12:15:19Z

Okay, I will undoskipna_switch

I am not able to reproduce the Circle CI bugs locally, even after making a Python 2.7 environment (all the tests pass locally). Here is the conda environment I am using:

apipkg                    1.4                       <pip>
dateutil                  2.4.1                    py27_0  
execnet                   1.4.1                     <pip>
joblib                    0.11                      <pip>
lxml                      3.8.0                     <pip>
mkl                       2017.0.3                      0  
numpy                     1.13.1                   py27_0  
openssl                   1.0.2k                        2  
pip                       9.0.1                    py27_1  
py                        1.4.34                    <pip>
pytest                    3.2.1                     <pip>
pytest-forked             0.2                       <pip>
pytest-xdist              1.20.0                    <pip>
python                    2.7.13                        0  
pytz                      2017.2                   py27_0  
readline                  6.2                           2  
setuptools                27.2.0                   py27_0  
six                       1.10.0                   py27_0  
sqlite                    3.13.0                        0  
tk                        8.5.18                        0  
wheel                     0.29.0                   py27_0  
zlib                      1.2.8                         3

jreback · 2017-09-22T13:51:59Z

can you rebase / update

jreback · 2017-10-28T00:37:23Z

pls rebase

jreback · 2017-11-12T19:46:39Z

can you rebase

jreback · 2017-12-11T00:24:40Z

master

In [1]:  pd.set_option('use_bottleneck', False)

In [2]: np.random.seed(1234); df = pd.DataFrame(np.random.randn(100000, 4))

In [3]: %timeit df.median(1)
2.86 s ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PR

In [1]:  pd.set_option('use_bottleneck', False)

In [2]: np.random.seed(1234); df = pd.DataFrame(np.random.randn(100000, 4))

In [3]: %timeit df.median(1)
21.7 ms ± 470 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

rohanp mentioned this pull request May 25, 2017

PERF: .median(axis=1) perf issues #16468

Closed

jreback reviewed May 25, 2017

View reviewed changes

jreback added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels May 25, 2017

jorisvandenbossche reviewed Aug 8, 2017

View reviewed changes

jreback requested changes Aug 8, 2017

View reviewed changes

jorisvandenbossche added this to the 0.21.0 milestone Aug 9, 2017

jorisvandenbossche reviewed Aug 9, 2017

View reviewed changes

jorisvandenbossche mentioned this pull request Aug 9, 2017

CLN/ASV clean-up frame stat ops benchmarks #17205

Merged

jreback requested changes Aug 12, 2017

View reviewed changes

jreback reviewed Aug 19, 2017

View reviewed changes

refactored skipna to switch

23bbe73

rohanp force-pushed the rohanp branch from 1c7fced to 23bbe73 Compare August 23, 2017 03:29

rohanp added 2 commits August 23, 2017 00:30

Merge remote-tracking branch 'upstream/master' into rohanp

53a1ccc

removed numpy < 1.9 support

c053b75

jreback added 2 commits December 10, 2017 18:34

Merge branch 'master' into PR_TOOL_MERGE_PR_16509

2735a7e

cleanups

6329118

jreback added 5 commits December 10, 2017 19:27

doc update

21160a0

remove algos/median & kth_smallest - no longer used

21da28f

Merge branch 'master' into PR_TOOL_MERGE_PR_16509

afc61fd

whatsnew

1123477

remove more code

3025f79

jreback added this to the 0.23.0 milestone Jan 21, 2018

import issue

5b71ee1

jreback approved these changes Jan 21, 2018

View reviewed changes

jreback added 3 commits January 21, 2018 15:57

moar

f25a62a

Merge branch 'master' into PR_TOOL_MERGE_PR_16509

4583e69

moar

027a2b2

jreback merged commit 3702492 into pandas-dev:master Jan 22, 2018

		values, mask, dtype, dtype_max = _get_values(values, skipna)

		def get_median(x):

		@@ -36,6 +37,7 @@ def set_use_bottleneck(v=True):
		_USE_BOTTLENECK = v

		@@ -344,12 +344,22 @@ def nanmedian(values, axis=None, skipna=True):

		values, mask, dtype, dtype_max = _get_values(values, skipna)

		if not skipna:

PERF: optimized median func when bottleneck not present #16509

PERF: optimized median func when bottleneck not present #16509

Conversation

rohanp commented May 25, 2017 • edited by jreback Loading

Choose a reason for hiding this comment

jreback commented May 25, 2017

rohanp commented May 25, 2017

rohanp commented May 25, 2017

jreback commented May 25, 2017

jreback commented May 25, 2017

rohanp commented May 25, 2017

TomAugspurger commented May 26, 2017

rohanp commented May 27, 2017

jreback commented May 27, 2017

rohanp commented May 29, 2017

TomAugspurger commented May 30, 2017

jreback commented Jun 10, 2017

jreback commented Jul 7, 2017

jreback commented Jul 19, 2017

pep8speaks commented Aug 8, 2017 • edited Loading

Comment last updated on January 21, 2018 at 21:03 Hours UTC

rohanp commented Aug 8, 2017

jorisvandenbossche commented Aug 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohanp commented Aug 9, 2017

jorisvandenbossche commented Aug 10, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohanp commented Aug 12, 2017

rohanp commented Aug 19, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohanp commented Aug 22, 2017

gfyoung commented Aug 22, 2017

jorisvandenbossche commented Aug 23, 2017

jreback commented Aug 23, 2017

jorisvandenbossche commented Aug 23, 2017

rohanp commented Aug 23, 2017 • edited Loading

jreback commented Sep 22, 2017

jreback commented Oct 28, 2017

jreback commented Nov 12, 2017

jreback commented Dec 11, 2017

rohanp commented May 25, 2017 •

edited by jreback

Loading

pep8speaks commented Aug 8, 2017 •

edited

Loading

jorisvandenbossche left a comment •

edited

Loading

rohanp commented Aug 23, 2017 •

edited

Loading