-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: optimized median func when bottleneck not present #16509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/nanops.py
Outdated
values, mask, dtype, dtype_max = _get_values(values, skipna) | ||
|
||
def get_median(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this will break in numpy < 1.9, you need to do this conditionally.
needs an asv (run with and w/o bottleneck) |
Do i need to run all the tests or can i get away with a subset? |
Also do I have to manually turn bottleneck on and off for the tests? |
have 2 tests, 1 with it forced on, 1 with forced off
just the median tests :> |
actually I don't think bottleneck is in our asv config. (separate issue is it should be) |
does that just entail |
@rohanp |
I don't understand why the tests are failing. How do I see the test cases that failed? |
click on the red x |
I'm not quite sure what you mean. I figured out how to see the details of the tests that failed; for example on circleci I can see
But what I do not understand is how to find the input that caused the test to fail. I checked the |
@rohanp these tests have several layers of indirection (to allow more code reuse). I would try The actual arrays will be something like |
can you rebase and update? |
can you rebase and update |
can you rebase / update according to comments |
Hello @rohanp! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on January 21, 2018 at 21:03 Hours UTC |
Hi, sorry for the delay. I think I figured out the bug. Thanks @TomAugspurger for the debugging command. How do I rebase and update? |
@rohanp you can do the following (assuming that pandas-dev/pandas remote repo is called 'upstream')
|
pandas/core/nanops.py
Outdated
@@ -36,6 +37,7 @@ def set_use_bottleneck(v=True): | |||
_USE_BOTTLENECK = v | |||
|
|||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can remove this blank line? (in general, please try to avoid such changes on lines that you are not changing for this PR)
pandas/core/nanops.py
Outdated
@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True): | |||
@bottleneck_switch() | |||
def nanmedian(values, axis=None, skipna=True): | |||
|
|||
if not skipna and isnull(np.sum(values)): | |||
return np.median(values, axis=axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this special case needed? As np.nanmedian
is almost as fast as np.median
, also when there are no NaNs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohanp Can you deal with this comment? (remove or give some explanation on why you think it is needed)
BTW, isnull
was recently be renamed to isna
in this file, that is the reason all tests are failing now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a whatsnew note in performance section.
pandas/core/nanops.py
Outdated
@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True): | |||
@bottleneck_switch() | |||
def nanmedian(values, axis=None, skipna=True): | |||
|
|||
if not skipna and isnull(np.sum(values)): | |||
return np.median(values, axis=axis) | |||
|
|||
values, mask, dtype, dtype_max = _get_values(values, skipna) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use _np_version_under1p19
which are already defined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then define
if _np_under_version1p19:
agg1d = get_median
aggaxis = lambda values, axis: np.apply_along_axis(get_median, axis, values)
else:
agg1d = np.nanmedian
aggaxis = lambda values, axis: np.nanmedian(values, axis)
then the code is almost the same as before, just replace with agg1d
and aggaxis
pandas/core/nanops.py
Outdated
@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True): | |||
@bottleneck_switch() | |||
def nanmedian(values, axis=None, skipna=True): | |||
|
|||
if not skipna and isnull(np.sum(values)): | |||
return np.median(values, axis=axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update looks good, please have a look at the remaining comment
Further:
- can you add a whatsnew note? (in performance improvements section)
pandas/core/nanops.py
Outdated
@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True): | |||
@bottleneck_switch() | |||
def nanmedian(values, axis=None, skipna=True): | |||
|
|||
if not skipna and isnull(np.sum(values)): | |||
return np.median(values, axis=axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohanp Can you deal with this comment? (remove or give some explanation on why you think it is needed)
BTW, isnull
was recently be renamed to isna
in this file, that is the reason all tests are failing now.
pandas/tests/test_nanops.py
Outdated
@@ -189,6 +189,10 @@ def check_fun_data(self, testfunc, targfunc, testarval, targarval, | |||
targ = targfunc(targartempval, axis=axis, **kwargs) | |||
res = testfunc(testarval, axis=axis, skipna=skipna, | |||
**kwargs) | |||
|
|||
#if not np.allclose(targ, res): | |||
# import pdb; pdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leftover to remove
Hi @jorisvandenbossche, my changes are causing the tests to fail. I'm still looking into why. I'll remove that once I get to the bottom of what the problem is. |
see my comment above, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now (but, we also asked to remove the part that uses isnull (or explain why it is needed)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of things to remove. pls add a whatsnew note (in perf). ping on green.
pandas/core/nanops.py
Outdated
@@ -342,6 +343,9 @@ def nanmean(values, axis=None, skipna=True): | |||
@bottleneck_switch() | |||
def nanmedian(values, axis=None, skipna=True): | |||
|
|||
#if False and not skipna and isna(np.sum(values)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
Hi, I had put that in because the tests were failing and I thought that would fix it (but apparently not). The tests are still failing without it. |
@jorisvandenbossche thank you for that catch! I've now double checked that nothing in the diff has changed except for the contents of pandas/util/testing.py:1105: AssertionError self = <pandas.tests.test_panel4d.TestPanel4d object at 0x7f3358785e50>
pandas/tests/test_panel4d.py:54: |
pandas/core/nanops.py
Outdated
@@ -344,12 +344,22 @@ def nanmedian(values, axis=None, skipna=True): | |||
|
|||
values, mask, dtype, dtype_max = _get_values(values, skipna) | |||
|
|||
if not skipna: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you adding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests compare nanmedian to numpy's median function. If this special case were not there, this function would return a non-null result for input with null values and skipna turned off. Here is a debug log demonstrating this: https://gist.github.com/rohanp/1082ad5a5e199f6b1cb8ade965c4519f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so fix the test to do the correct comparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the desired behavior for nanmedian to be like median when skipna is turned off though? If not, then what is the skipna flag supposed to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, so use the skipna
flag as a switch when generating which function to use.
It looks like the tests are passing on when I run Could someone please explain how I can replicate the environment for CircleCI on my local computer so that I can debug? I tried looking in the ci folder but am not sure what I should be looking for (a conda environment.yml?) |
The Travis failures are style-checking errors, which you can see here: https://travis-ci.org/pandas-dev/pandas/jobs/267057585#L1937 Address those lint issues, and you should get a green-light from Travis. The Circle failures are coming from Python 2.7. Are you using Python 3.x to test your changes? If so, then you should create a Python 2.x environment (either with |
@rohanp The You still have failing median tests on Circle CI. Are you able to reproduce that locally? |
@rohanp yeah I would go back to the previous iteration. This needs to use the version detection as well (you had previously) |
why? it was for numpy < 1.9, which we just removed? |
Okay, I will undo I am not able to reproduce the Circle CI bugs locally, even after making a Python 2.7 environment (all the tests pass locally). Here is the conda environment I am using:
|
can you rebase / update |
pls rebase |
can you rebase |
master
PR
|
closes #16468
Uses np.nanmedian to compute median instead of internal algos.median when bottleneck is not present/turned off. Has order of magnitude performance benefits
df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))
Before
After