Skip to content

BUG: Binned groupby median function calculates median on empty bins and outputs random numbers #13629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Khris777 opened this issue Jul 12, 2016 · 3 comments · Fixed by #14225
Labels
Bug Categorical Categorical Data Type Groupby

Comments

@Khris777
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd
d = pd.DataFrame([1,2,5,6,9,3,6,5,9,7,11,36,4,7,8,25,8,24])
b = [0,5,10,15,20,25,30,35,40,45,50,55]
g = d.groupby(pd.cut(d[0],b))
print g.mean()
print g.median()
print g.get_group('(0, 5]').median()
print g.get_group('(40, 45]').median()

Expected Output

                  0
0                  
(0, 5]     3.333333
(5, 10]    7.500000
(10, 15]  11.000000
(15, 20]        NaN
(20, 25]  24.500000
(25, 30]        NaN
(30, 35]        NaN
(35, 40]  36.000000
(40, 45]        NaN
(45, 50]        NaN
(50, 55]        NaN
             0
0             
(0, 5]     3.5
(5, 10]    7.5
(10, 15]  11.0
(15, 20]  18.0
(20, 25]  24.5
(25, 30]  30.5
(30, 35]  30.5
(35, 40]  36.0
(40, 45]  18.0
(45, 50]  18.0
(50, 55]  18.0
0    3.5
dtype: float64
Traceback (most recent call last):

  File "<ipython-input-9-0663486889da>", line 1, in <module>
    runfile('C:/PythonDir/test04.py', wdir='C:/PythonDir')

  File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
    execfile(filename, namespace)

  File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/PythonDir/test04.py", line 20, in <module>
    print g.get_group('(40, 45]').median()

  File "C:\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 587, in get_group
    raise KeyError(name)

KeyError: '(40, 45]'

This example shows how the median-function of the groupby object outputs a random number instead of NaN like the mean-function does when a bin is empty. Directly trying to call that bin with its key leads to an error since it doesn't exist, yet the full median output suggests it does exist and that the value might even be meaningful (like in the (15, 20] bin or the (30, 35] bin). The wrong numbers that are returned can change randomly, another possible output using the same code might look like this:

(0, 5]     3.500000e+00
(5, 10]    7.500000e+00
(10, 15]   1.100000e+01
(15, 20]   1.800000e+01
(20, 25]   2.450000e+01
(25, 30]   3.050000e+01
(30, 35]   3.050000e+01
(35, 40]   3.600000e+01
(40, 45]  4.927210e+165
(45, 50]  4.927210e+165
(50, 55]  4.927210e+165

output of pd.show_versions()

pandas: 0.18.1

@Khris777 Khris777 changed the title Bug: Binned groupby median function calculates median on empty bins and outputs random numbers BUG: Binned groupby median function calculates median on empty bins and outputs random numbers Jul 12, 2016
@jorisvandenbossche
Copy link
Member

@Khris777 Thanks for reporting!

As a workaround for now, you can do:

In [11]: g.agg(lambda x: x.median())
Out[11]:
             0
0
(0, 5]     3.5
(5, 10]    7.5
(10, 15]  11.0
(15, 20]   NaN
(20, 25]  24.5
(25, 30]   NaN
(30, 35]   NaN
(35, 40]  36.0
(40, 45]   NaN
(45, 50]   NaN
(50, 55]   NaN

@jorisvandenbossche jorisvandenbossche added this to the 0.19.0 milestone Jul 12, 2016
@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, 0.19.0 Aug 21, 2016
@paul-mannino
Copy link
Contributor

paul-mannino commented Sep 13, 2016

First time contributor, thought I'd take a look into this one. Do you think there's a more logical response than raising a KeyError to g.get_group('(40, 45]') ?

get_group with no additional arguments is supposed to return a subset of the original dataframe with values that fall within the specified interval. If there are no values in the interval (40,45] in the original dataframe, there's no way to slice that up into a sensible response. Empty dataframe?

@jreback
Copy link
Contributor

jreback commented Sep 13, 2016

ATM, internval types are actual string reprs (and not a distinct dtype), so yes, g.get_group('(40, 45)') should be a KeyError, just like any other indexing operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants