Skip to content

Inconsistant behaviour of empty groups when grouping with one vs. many #23865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mojones opened this issue Nov 23, 2018 · 8 comments · Fixed by #35022
Closed

Inconsistant behaviour of empty groups when grouping with one vs. many #23865

mojones opened this issue Nov 23, 2018 · 8 comments · Fixed by #35022
Assignees
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@mojones
Copy link
Contributor

mojones commented Nov 23, 2018

Not sure if this is a bug or expected behaviour, but it's something that catches me out constantly.

import pandas as pd
# Dummy data. Each row is a day and I'm recording where I worked, 
# what percentage of my time was spent coding, 
# and how many cups of coffee I drank
df = pd.DataFrame({
    'place' : ['office','cafe','office','cafe','office','cafe','cafe','cafe','office','office'],
    'percentage':[5, 15, 72,24, 34, 35, 37, 47, 78, 67],
    'coffee': [1,2,0,0,1,2,0,0,1,1] })

# this includes zero counts
df.groupby(pd.cut(df['percentage'], range(0, 100, 10))).size()

# this doesn't 
df.groupby(
    [pd.cut(df['percentage'], range(0, 100, 10)),
    'place']
     ).size()

Problem description

I want to look at the distribution of my percentage columns, so I do groupby so that I can use sensible bins. This is my first gropuby in the code example. The output is as expected:

percentage
(0, 10]     1
(10, 20]    1
(20, 30]    1
(30, 40]    3
(40, 50]    1
(50, 60]    0
(60, 70]    1
(70, 80]    2
(80, 90]    0
dtype: int64

Including the bins for which there were zero observations (50-60% and 80-90%).

Now I want to also group by place. This is my second groupby. Now my empty groups disappear:

percentage  place 
(0, 10]     office    1
(10, 20]    cafe      1
(20, 30]    cafe      1
(30, 40]    cafe      2
                 office    1
(40, 50]    cafe      1
(60, 70]    office    1
(70, 80]    office    2
dtype: int64

When I unstack this to make a summary table the empty percentage bins are missing:

place       cafe  office
percentage              
(0, 10]      NaN     1.0
(10, 20]     1.0     NaN
(20, 30]     1.0     NaN
(30, 40]     2.0     1.0
(40, 50]     1.0     NaN
(60, 70]     NaN     1.0
(70, 80]     NaN     2.0

This will cause a problem when I go to plot this data, as the spacing on my 'percentage' axis will not be consistent. If I'm just counting the size of the bins then I can get round the problem by reindexing with the original categories:

df.groupby(
    [pd.cut(df['percentage'], range(0, 100, 10)),
    'place']
     ).size().unstack().reindex(pd.cut(df['percentage'], range(0, 100, 10)).cat.categories).fillna(0).stack()

percentage  place 
(0, 10]     cafe      0.0
            office    1.0
(10, 20]    cafe      1.0
            office    0.0
(20, 30]    cafe      1.0
            office    0.0
(30, 40]    cafe      2.0
            office    1.0
(40, 50]    cafe      1.0
            office    0.0
(50, 60]    cafe      0.0
            office    0.0
(60, 70]    cafe      0.0
            office    1.0
(70, 80]    cafe      0.0
            office    2.0
(80, 90]    cafe      0.0
            office    0.0
dtype: float64

but this seems like a complicated solution for what should be a relatively common problem. And if I want to do some other type of aggregation, e.g. calculate the mean number of cups of coffee for each group, I can't figure it out:

df.groupby(
    [pd.cut(df['percentage'], range(0, 100, 10)),
    'place']
     )['coffee'].mean()

percentage  place 
(0, 10]     office    1.0
(10, 20]    cafe      2.0
(20, 30]    cafe      0.0
(30, 40]    cafe      1.0
            office    1.0
(40, 50]    cafe      0.0
(60, 70]    office    1.0
(70, 80]    office    0.5

I have missing groups, but I can't fix it with stack/fillna/unstack as I don't want to fill in a value - I want to leave it as missing data, but still have the group appear.

Reading through previous issues, it sounds like I am describing this:

#8138

but the thread says it's fixed, so I can't figure out what is different in my situation.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.8.1
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Nov 23, 2018

Can you produce a more minimal example to reproduce the error? I'm assuming pd.cut isn't related to the issue at all here but tough to tell from example(s) provided

@WillAyd WillAyd added Groupby Needs Info Clarification about behavior needed to assess issue labels Nov 23, 2018
@mojones
Copy link
Contributor Author

mojones commented Nov 23, 2018

Here's the most minimal example I think I can come up with:

df = pd.DataFrame({'foo' : [1,1,3,3], 'bar' : ['A', 'B', 'A', 'B']})

# gives zero count for the (1, 2] bin as expected
df.groupby(pd.cut(df['foo'], bins=[0,1,2,3])).size()

# the (1,2] bin is not shown
df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar']).size()

does this make sense? I can't think of a meaningful example that doesn't use pd.cut, because that's the only way that we get index labels that aren't directly derived from the data. I.e. if we just group on two columns, then every label is in the data at least once by definition, so will appear at least once in the index.

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

i think this is specific to logic in .size(). welcome investigation.

In [8]: df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar']).count()
Out[8]: 
            foo
foo    bar     
(0, 1] A    1.0
       B    1.0
(1, 2] A    NaN
       B    NaN
(2, 3] A    1.0
       B    1.0

In [9]: df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar']).size()
Out[9]: 
foo     bar
(0, 1]  A      1
        B      1
(2, 3]  A      1
        B      1
dtype: int64

@WillAyd WillAyd added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Info Clarification about behavior needed to assess issue labels Nov 23, 2018
@mojones
Copy link
Contributor Author

mojones commented Sep 27, 2019

Coming back to this issue as I recently ran into another example in real life data. Another way of looking at the problem that occurred to me is as a difference between calling count on the groups then taking a single column, which gives the expected result:

df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar']).count()['foo']

# output
foo     bar
(0, 1]  A      1.0
        B      1.0
(1, 2]  A      NaN
        B      NaN
(2, 3]  A      1.0
        B      1.0
Name: foo, dtype: float64

and taking the column first then calling count:

df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar'])['foo'].count()

# output 
foo     bar
(0, 1]  A      1
        B      1
(2, 3]  A      1
        B      1
Name: foo, dtype: int64

Which gives the output where the (1,2] bin is missing. So maybe it makes sense to think of it as a difference in behaviour between the count methods of DataFrameGroupBy and SeriesGroupBy objects?

I notice that groupby creates the same set of groups in each case:

[x for x,y in df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar'])]

# output
[(Interval(0, 1, closed='right'), 'A'),
 (Interval(0, 1, closed='right'), 'B'),
 (Interval(2, 3, closed='right'), 'A'),
 (Interval(2, 3, closed='right'), 'B')]

[x for x,y in df.groupby([pd.cut(df['foo'], bins=[0,1,2,3]), 'bar'])['foo']]

# output
[(Interval(0, 1, closed='right'), 'A'),
 (Interval(0, 1, closed='right'), 'B'),
 (Interval(2, 3, closed='right'), 'A'),
 (Interval(2, 3, closed='right'), 'B')]

i.e. missing the empty groups. For my use case, the ideal solution might be to have a include_empty_groups boolean argument to groupby, which would presumably take care of all three scenarios (calling size or group on either a DataFrameGroupBy or a SeriesGroupBy, or iterating over groups explicitly). Although that's based on absolutely no knowledge of the code base :-) If anyone has a quick pointer to where in the code this is implemented, I wouldn't mind taking a look.

@mojones
Copy link
Contributor Author

mojones commented Mar 20, 2020

This seems to have been fixed between 0.25.0 and 1.0.0. Not sure if it's appropriate to close this in case it gets regressed, but please do so if it is.

@simonjayhawkins
Copy link
Member

This seems to have been fixed between 0.25.0 and 1.0.0.

master includes the (1,2] bin

>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1272.g3d4f9dc19'
>>>
>>> df = pd.DataFrame({"foo": [1, 1, 3, 3], "bar": ["A", "B", "A", "B"]})
>>>
>>> # gives zero count for the (1, 2] bin as expected
>>> df.groupby(pd.cut(df["foo"], bins=[0, 1, 2, 3])).size()
foo
(0, 1]    2
(1, 2]    0
(2, 3]    2
dtype: int64
>>>
>>> # the (1,2] bin is not shown
>>> df.groupby([pd.cut(df["foo"], bins=[0, 1, 2, 3]), "bar"]).size()
foo     bar
(0, 1]  A      1
        B      1
(1, 2]  A      0
        B      0
(2, 3]  A      1
        B      1
dtype: int64
>>>

@simonjayhawkins simonjayhawkins added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Apr 16, 2020
@simonjayhawkins
Copy link
Member

can confirm (using code sample in #23865 (comment)) that this issue was resolved in #29690

@smithto1
Copy link
Member

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
5 participants