Skip to content

BUG: categorical column dropped from groupby agg result when as_index=False #8770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aimboden opened this issue Nov 10, 2014 · 7 comments
Closed
Labels
Bug Categorical Categorical Data Type Groupby

Comments

@aimboden
Copy link

Hello everyone, updating to Pandas 0.15.1 breaks my code because of the following behaviour. When performing an aggregation operation on a groupby object containing a categorical, with as_index=False, this column is now dropped from the aggregation result. This may be an unexpected consequence of #8585 , and certainly looks as a bug to me.

import pandas as pd
import numpy as np

d = {'Foo': [1, 2, 3, 4, 5, 6, 11],
     'Bar': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB',
             'CC'],
     'Baz': [10, 20, 30, 40, 50, 60, 70]}

df = pd.DataFrame(d)
ls = np.linspace(0, 30, 7)
groups = df.groupby([pd.cut(df['Foo'], ls), 'Bar'], as_index=False)

df2 = groups['Baz'].agg({'Mean Baz': 'mean'})

produces:

>>> pd.__version__
'0.15.1'
>>> df2
Bar Mean Baz
0 AA 20
1 BB 45
2 BB 60
3 CC 70

, whereas I expect the following:

>>> pd.__version__
'0.15.0'
Foo Bar Mean Baz
0 (0, 5] AA 20
1 (0, 5] BB 45
2 (5, 10] BB 60
3 (10, 15] CC 70
@aimboden aimboden changed the title BUG: Categorical column dropped from groupby agg result when as_index=False BUG: categorical column dropped from groupby agg result when as_index=False Nov 10, 2014
@jreback jreback added Bug Categorical Categorical Data Type Groupby labels Nov 10, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 10, 2014
@jreback
Copy link
Contributor

jreback commented Nov 10, 2014

cc @behzadnouri

I marked this as a bug. But on second thought I think this is correct and the prior behavior was wrong.

You are grouping with 2 anonymous arrays (well a Series and an array). Neither of which exist in the grouped object. By definition groupby will make the groupers into the result index. But since you are dropping it (with as_index=False), then they are removed.

You can add these groupers as columns to preserver them, or use as_index=True (and then reset_index).

@behzadnouri
Copy link
Contributor

yes, this was changed by #8585 in order to have

as_index=False is effectively “SQL-style” grouped output

As @jreback mentioned, for the old behaviour, you can either do as_index=True followed by reset_index or add the groupers as new columns to the frame, and do groupby by the columns:

In [19]: groups = df.groupby([pd.cut(df['Foo'], ls), 'Bar'], as_index=True)

In [20]: groups['Baz'].agg({'Mean Baz': 'mean'}).reset_index()
Out[20]: 
        Foo Bar  Mean Baz
0    (0, 5]  AA        20
1    (0, 5]  BB        45
2   (5, 10]  BB        60
3  (10, 15]  CC        70

or,

In [21]: df['Foo.1'] = pd.cut(df['Foo'], ls)

In [22]: groups = df.groupby(['Foo.1', 'Bar'], as_index=False)

In [23]: groups['Baz'].agg({'Mean Baz': 'mean'})
Out[23]: 
      Foo.1 Bar  Mean Baz
0    (0, 5]  AA        20
1    (0, 5]  BB        45
2   (5, 10]  BB        60
3  (10, 15]  CC        70

@aimboden
Copy link
Author

@jreback

If you think the previous behavior was wrong, I can live with that using reset_index.

However, I still think that it is inconsistent. Why would the categorical that we are grouping with be dropped, while the Series that is also used as a grouper ('Bar') is kept?

If we follow through with your logic, then shouldn't the result simply be

Mean Baz
0 20
1 45
2 60
3 70

, since both groupers would then be dropped from the resulting DataFrame?

@jreback
Copy link
Contributor

jreback commented Nov 10, 2014

@Gimli510 no, the 'Bar' is added back, as a named grouper is not excluded. I could see that this is confusing. @behzadnouri ?

@aimboden
Copy link
Author

@behzadnouri
Thanks for explaining the rationale behind this change, which makes sense regarding the issues encountered. Maybe there is no better way? I still feel that this should be documented.

Maybe the API changes section of the release notes should be updated?
It states that

groupby with as_index=False will not add erroneous extra columns to result

I couldn't find any hint that the addition of the grouper to the resulting DataFrame was an erroneous extra column. The "SQL-Style" grouped output of the documentation is also not very telling to the mere user that I am :/

@aimboden
Copy link
Author

@jreback but the grouper does have a name in this case, doesn't it?

>>> grouper = pd.cut(df['Foo'], ls)
>>> grouper
0      (0, 5]
1      (0, 5]
2      (0, 5]
3      (0, 5]
4      (0, 5]
5     (5, 10]
6    (10, 15]
Name: Foo, dtype: category
Categories (6, object): [(0, 5] < (5, 10] < (10, 15] < (15, 20] < (20, 25] < (25, 30]]

@jreback jreback modified the milestones: 0.16.0, 0.15.2 Dec 4, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@wesm
Copy link
Member

wesm commented Jul 6, 2018

Closing. I welcome anyone who wishes to revisit this with pandas >= 0.23.2

@wesm wesm closed this as completed Jul 6, 2018
@wesm wesm added the Won't Fix label Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Groupby
Projects
None yet
Development

No branches or pull requests

4 participants