Skip to content

PERF: optimized median func when bottleneck not present #16509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 22, 2018

Conversation

rohanp
Copy link
Contributor

@rohanp rohanp commented May 25, 2017

closes #16468

Uses np.nanmedian to compute median instead of internal algos.median when bottleneck is not present/turned off. Has order of magnitude performance benefits

df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))

Before

>>> pd.set_option('use_bottleneck', False)
>>> %timeit df.median(1)
318 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
1.83 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> pd.set_option('use_bottleneck', True)
>>> %timeit df.median(1)
239 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After

>>> pd.set_option('use_bottleneck', False)
>>> %timeit df.median(1)
1.89 ms ± 53.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> pd.set_option('use_bottleneck', True)
>>> %timeit df.median(1)
227 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

values, mask, dtype, dtype_max = _get_values(values, skipna)

def get_median(x):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this will break in numpy < 1.9, you need to do this conditionally.

@jreback
Copy link
Contributor

jreback commented May 25, 2017

needs an asv (run with and w/o bottleneck)

@jreback jreback added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels May 25, 2017
@rohanp
Copy link
Contributor Author

rohanp commented May 25, 2017

Do i need to run all the tests or can i get away with a subset?

@rohanp
Copy link
Contributor Author

rohanp commented May 25, 2017

Also do I have to manually turn bottleneck on and off for the tests?

@jreback
Copy link
Contributor

jreback commented May 25, 2017

Also do I have to manually turn bottleneck on and off for the tests?

have 2 tests, 1 with it forced on, 1 with forced off

Do i need to run all the tests or can i get away with a subset?

just the median tests :>

@jreback
Copy link
Contributor

jreback commented May 25, 2017

actually I don't think bottleneck is in our asv config. (separate issue is it should be)

@rohanp
Copy link
Contributor Author

rohanp commented May 25, 2017

just the median tests :>

does that just entail test_nanops.py?

@TomAugspurger
Copy link
Contributor

@rohanp asv continuous -f 1.1 upstream/master HEAD -b median should work. See http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite

@rohanp
Copy link
Contributor Author

rohanp commented May 27, 2017

I don't understand why the tests are failing. How do I see the test cases that failed?

@jreback
Copy link
Contributor

jreback commented May 27, 2017

click on the red x

@rohanp
Copy link
Contributor Author

rohanp commented May 29, 2017

I'm not quite sure what you mean. I figured out how to see the details of the tests that failed; for example on circleci I can see

self = <pandas.tests.test_nanops.TestnanopsDataFrame object at 0x7f67d2d65c88>

    def test_nanmedian(self):
        with warnings.catch_warnings(record=True):
            self.check_funs(nanops.nanmedian, np.median, allow_complex=False,
                            allow_str=False, allow_date=False,
>                           allow_tdelta=True, allow_obj='convert')

pandas/tests/test_nanops.py:374: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/test_nanops.py:245: in check_funs
    **kwargs)
pandas/tests/test_nanops.py:234: in check_fun
    targarnanval, **kwargs)
pandas/tests/test_nanops.py:193: in check_fun_data
    check_dtype=check_dtype)
pandas/tests/test_nanops.py:146: in check_results
    tm.assert_almost_equal(targ, res, check_dtype=check_dtype)
pandas/util/testing.py:234: in assert_almost_equal
    **kwargs)
pandas/_libs/testing.pyx:59: in pandas._libs.testing.assert_almost_equal (pandas/_libs/testing.c:4156)
    cpdef assert_almost_equal(a, b,
pandas/_libs/testing.pyx:173: in pandas._libs.testing.assert_almost_equal (pandas/_libs/testing.c:3274)
    raise_assert_detail(obj, msg, lobj, robj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

obj = 'numpy array', message = 'numpy array values are different (20.0 %)'
left = '[[nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan]]'
right = '[[0.0511871111488, -0.458724351947, -0.0208572588354, 0.00471620824214, -0.487450340243], [0.254176310551, -0.0309721..., -0.2722898607, 0.174928469595], [-0.431481894512, 0.245270844279, 0.140055994353, -0.0727601005076, 0.238682362073]]'

But what I do not understand is how to find the input that caused the test to fail. I checked the test_nanops.py file and while it has the tests it does not seem to have the test cases.

@TomAugspurger
Copy link
Contributor

@rohanp these tests have several layers of indirection (to allow more code reuse). I would try pytest pandsa/tests/test_nanops.py -k test_nanmedian --pdb to drop into a debugger on failure so you can inspect what's going on. But the basic idea is test_nanmedian -> check_funs -> check_fun -> check_fun_data -> check_results, which is failing for one of the input array.

The actual arrays will be something like self.arr_float or arr_float_nan and are accessed with getattr.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2017

can you rebase and update?

@jreback
Copy link
Contributor

jreback commented Jul 7, 2017

can you rebase and update

@jreback
Copy link
Contributor

jreback commented Jul 19, 2017

can you rebase / update according to comments

@pep8speaks
Copy link

pep8speaks commented Aug 8, 2017

Hello @rohanp! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 21, 2018 at 21:03 Hours UTC

@rohanp
Copy link
Contributor Author

rohanp commented Aug 8, 2017

Hi, sorry for the delay. I think I figured out the bug. Thanks @TomAugspurger for the debugging command. How do I rebase and update?

@jorisvandenbossche
Copy link
Member

@rohanp you can do the following (assuming that pandas-dev/pandas remote repo is called 'upstream')

git fetch upstream
git checkout rohanp
git merge upstream/master # after this you can fix merge conflicts, and do 'git commit' to continue
git push origin rohanp

@@ -36,6 +37,7 @@ def set_use_bottleneck(v=True):
_USE_BOTTLENECK = v



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this blank line? (in general, please try to avoid such changes on lines that you are not changing for this PR)

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):

if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this special case needed? As np.nanmedian is almost as fast as np.median, also when there are no NaNs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohanp Can you deal with this comment? (remove or give some explanation on why you think it is needed)

BTW, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a whatsnew note in performance section.

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):

if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)

values, mask, dtype, dtype_max = _get_values(values, skipna)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _np_version_under1p19 which are already defined

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then define

if _np_under_version1p19:
       agg1d = get_median
       aggaxis = lambda values, axis: np.apply_along_axis(get_median, axis, values)
else:
      agg1d = np.nanmedian
      aggaxis = lambda values, axis: np.nanmedian(values, axis)

then the code is almost the same as before, just replace with agg1d and aggaxis

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):

if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed

@jorisvandenbossche jorisvandenbossche added this to the 0.21.0 milestone Aug 9, 2017
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update looks good, please have a look at the remaining comment

Further:

  • can you add a whatsnew note? (in performance improvements section)

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):

if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohanp Can you deal with this comment? (remove or give some explanation on why you think it is needed)

BTW, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now.

@@ -189,6 +189,10 @@ def check_fun_data(self, testfunc, targfunc, testarval, targarval,
targ = targfunc(targartempval, axis=axis, **kwargs)
res = testfunc(testarval, axis=axis, skipna=skipna,
**kwargs)

#if not np.allclose(targ, res):
# import pdb; pdb.set_trace()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover to remove

@rohanp
Copy link
Contributor Author

rohanp commented Aug 9, 2017

Hi @jorisvandenbossche, my changes are causing the tests to fail. I'm still looking into why. I'll remove that once I get to the bottom of what the problem is.

@jorisvandenbossche
Copy link
Member

my changes are causing the tests to fail. I'm still looking into why

see my comment above, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now (but, we also asked to remove the part that uses isnull (or explain why it is needed))

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of things to remove. pls add a whatsnew note (in perf). ping on green.

@@ -342,6 +343,9 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):

#if False and not skipna and isna(np.sum(values)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

@rohanp
Copy link
Contributor Author

rohanp commented Aug 12, 2017

Hi, I had put that in because the tests were failing and I thought that would fix it (but apparently not). The tests are still failing without it.

@rohanp
Copy link
Contributor Author

rohanp commented Aug 19, 2017

@jorisvandenbossche thank you for that catch! I've now double checked that nothing in the diff has changed except for the contents of nanmedian, and the tests are passing on my local computer, but some of the same errors are popping up on the CI tests.

pandas/util/testing.py:1105: AssertionError
___________________________ TestPanel4d.test_median ____________________________

self = <pandas.tests.test_panel4d.TestPanel4d object at 0x7f3358785e50>

def test_median(self):
    def wrapper(x):
        if isna(x).any():
            return np.nan
        return np.median(x)
  self._check_stat_op('median', wrapper)

pandas/tests/test_panel4d.py:54:

@@ -344,12 +344,22 @@ def nanmedian(values, axis=None, skipna=True):

values, mask, dtype, dtype_max = _get_values(values, skipna)

if not skipna:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you adding this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests compare nanmedian to numpy's median function. If this special case were not there, this function would return a non-null result for input with null values and skipna turned off. Here is a debug log demonstrating this: https://gist.github.com/rohanp/1082ad5a5e199f6b1cb8ade965c4519f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so fix the test to do the correct comparison

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the desired behavior for nanmedian to be like median when skipna is turned off though? If not, then what is the skipna flag supposed to do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so use the skipna flag as a switch when generating which function to use.

@rohanp
Copy link
Contributor Author

rohanp commented Aug 22, 2017

It looks like the tests are passing on when I run pytest pandas locally as well as on AppVeyor, but failing on Travis CI and CircleCI.

Could someone please explain how I can replicate the environment for CircleCI on my local computer so that I can debug? I tried looking in the ci folder but am not sure what I should be looking for (a conda environment.yml?)

@gfyoung
Copy link
Member

gfyoung commented Aug 22, 2017

The Travis failures are style-checking errors, which you can see here:

https://travis-ci.org/pandas-dev/pandas/jobs/267057585#L1937

Address those lint issues, and you should get a green-light from Travis.

The Circle failures are coming from Python 2.7. Are you using Python 3.x to test your changes? If so, then you should create a Python 2.x environment (either with virtualenv or conda) in which you should install your changes and test.

@jorisvandenbossche
Copy link
Member

@rohanp The skipna_switch feels very complex to me. It is not possible to achieve the same by doing some switch within the nanmedian function?

You still have failing median tests on Circle CI. Are you able to reproduce that locally?

@jreback
Copy link
Contributor

jreback commented Aug 23, 2017

@rohanp yeah I would go back to the previous iteration. This needs to use the version detection as well (you had previously)

@jorisvandenbossche
Copy link
Member

This needs to use the version detection as well (you had previously)

why? it was for numpy < 1.9, which we just removed?

@rohanp
Copy link
Contributor Author

rohanp commented Aug 23, 2017

Okay, I will undoskipna_switch

I am not able to reproduce the Circle CI bugs locally, even after making a Python 2.7 environment (all the tests pass locally). Here is the conda environment I am using:

apipkg                    1.4                       <pip>
dateutil                  2.4.1                    py27_0  
execnet                   1.4.1                     <pip>
joblib                    0.11                      <pip>
lxml                      3.8.0                     <pip>
mkl                       2017.0.3                      0  
numpy                     1.13.1                   py27_0  
openssl                   1.0.2k                        2  
pip                       9.0.1                    py27_1  
py                        1.4.34                    <pip>
pytest                    3.2.1                     <pip>
pytest-forked             0.2                       <pip>
pytest-xdist              1.20.0                    <pip>
python                    2.7.13                        0  
pytz                      2017.2                   py27_0  
readline                  6.2                           2  
setuptools                27.2.0                   py27_0  
six                       1.10.0                   py27_0  
sqlite                    3.13.0                        0  
tk                        8.5.18                        0  
wheel                     0.29.0                   py27_0  
zlib                      1.2.8                         3 

@jreback
Copy link
Contributor

jreback commented Sep 22, 2017

can you rebase / update

@jreback
Copy link
Contributor

jreback commented Oct 28, 2017

pls rebase

@jreback
Copy link
Contributor

jreback commented Nov 12, 2017

can you rebase

@jreback
Copy link
Contributor

jreback commented Dec 11, 2017

master

In [1]:  pd.set_option('use_bottleneck', False)

In [2]: np.random.seed(1234); df = pd.DataFrame(np.random.randn(100000, 4))

In [3]: %timeit df.median(1)
2.86 s ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PR

In [1]:  pd.set_option('use_bottleneck', False)

In [2]: np.random.seed(1234); df = pd.DataFrame(np.random.randn(100000, 4))

In [3]: %timeit df.median(1)
21.7 ms ± 470 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@jreback jreback added this to the 0.23.0 milestone Jan 21, 2018
@jreback jreback merged commit 3702492 into pandas-dev:master Jan 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: .median(axis=1) perf issues
6 participants