You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
import pandas as pd
d = pd.DataFrame([1,2,5,6,9,3,6,5,9,7,11,36,4,7,8,25,8,24])
b = [0,5,10,15,20,25,30,35,40,45,50,55]
g = d.groupby(pd.cut(d[0],b))
print g.mean()
print g.median()
print g.get_group('(0, 5]').median()
print g.get_group('(40, 45]').median()
Expected Output
0
0
(0, 5] 3.333333
(5, 10] 7.500000
(10, 15] 11.000000
(15, 20] NaN
(20, 25] 24.500000
(25, 30] NaN
(30, 35] NaN
(35, 40] 36.000000
(40, 45] NaN
(45, 50] NaN
(50, 55] NaN
0
0
(0, 5] 3.5
(5, 10] 7.5
(10, 15] 11.0
(15, 20] 18.0
(20, 25] 24.5
(25, 30] 30.5
(30, 35] 30.5
(35, 40] 36.0
(40, 45] 18.0
(45, 50] 18.0
(50, 55] 18.0
0 3.5
dtype: float64
Traceback (most recent call last):
File "<ipython-input-9-0663486889da>", line 1, in <module>
runfile('C:/PythonDir/test04.py', wdir='C:/PythonDir')
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
execfile(filename, namespace)
File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/PythonDir/test04.py", line 20, in <module>
print g.get_group('(40, 45]').median()
File "C:\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 587, in get_group
raise KeyError(name)
KeyError: '(40, 45]'
This example shows how the median-function of the groupby object outputs a random number instead of NaN like the mean-function does when a bin is empty. Directly trying to call that bin with its key leads to an error since it doesn't exist, yet the full median output suggests it does exist and that the value might even be meaningful (like in the (15, 20] bin or the (30, 35] bin). The wrong numbers that are returned can change randomly, another possible output using the same code might look like this:
The text was updated successfully, but these errors were encountered:
Khris777
changed the title
Bug: Binned groupby median function calculates median on empty bins and outputs random numbers
BUG: Binned groupby median function calculates median on empty bins and outputs random numbers
Jul 12, 2016
First time contributor, thought I'd take a look into this one. Do you think there's a more logical response than raising a KeyError to g.get_group('(40, 45]') ?
get_group with no additional arguments is supposed to return a subset of the original dataframe with values that fall within the specified interval. If there are no values in the interval (40,45] in the original dataframe, there's no way to slice that up into a sensible response. Empty dataframe?
ATM, internval types are actual string reprs (and not a distinct dtype), so yes, g.get_group('(40, 45)') should be a KeyError, just like any other indexing operation.
Code Sample, a copy-pastable example if possible
Expected Output
This example shows how the median-function of the groupby object outputs a random number instead of NaN like the mean-function does when a bin is empty. Directly trying to call that bin with its key leads to an error since it doesn't exist, yet the full median output suggests it does exist and that the value might even be meaningful (like in the (15, 20] bin or the (30, 35] bin). The wrong numbers that are returned can change randomly, another possible output using the same code might look like this:
output of
pd.show_versions()
pandas: 0.18.1
The text was updated successfully, but these errors were encountered: