DataFrame groupby with categoricals and aggreggation with pd.DataFrame.sum with skipna leads to wrong column name #28787

kasparthommen · 2019-10-04T13:32:37Z

Problem description

Consider the following data frame:

df = pd.DataFrame(data=(('Bob', 2),  ('Greg', None), ('Greg', 6)), columns=['Name', 'Items'])

   Name  Items
0   Bob    2.0
1  Greg    NaN
2  Greg    6.0

Now I want to group by Name and sum the Items, but I want the sum to be NaN if there are NaN elements. Due to a bug in pandas (#20824) I cannot simply do

df.groupby('Name', observed=True).sum(skipna=False).reset_index()

because that results in:

   Name  Items
0   Bob    2.0
1  Greg    6.0

which is wrong because it's skipping the NaN for Greg even though it shouldn't (hence the bug). Thus I'm using the following workaround to get the correct result:

df.groupby('Name', observed=True).agg(pd.DataFrame.sum, skipna=False).reset_index()

which results in the expected:

   Name  Items
0   Bob    2.0
1  Greg    NaN

However, if we change the Name column to categorical then the resulting column names are wrong:

df_cat = df.copy()
df_cat['Name'] = df_cat['Name'].astype('category')
df_cat.groupby('Name', observed=True).agg(pd.DataFrame.sum, skipna=False).reset_index()

which prints:

  index  Items
0   Bob    2.0
1  Greg    NaN

As you can see, the column that should be labelled Name is now called index.

Expected Output

The same as the non-categorical version, i.e.:

   Name  Items
0   Bob    2.0
1  Greg    NaN

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 7
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.3
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1
setuptools : 41.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.2.5
html5lib : None
pymysql : None
psycopg2 : 2.8.2 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.2.5
matplotlib : 3.0.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : 1.3.3
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None

The text was updated successfully, but these errors were encountered:

topper-123 · 2019-10-05T16:13:20Z

Yeah, this is a bug. Notice that it's the index that should inherit the name, but doesn't:

>>> df_cat.groupby('Name', observed=True).sum(skipna=False)
      Items
Names  # this is missing
Bob     2.0
Greg    6.0

kasparthommen · 2019-10-07T07:51:44Z

Wow, that was quick! Thanks for fixing :-)

kasparthommen · 2019-10-07T07:52:46Z

Fixed by #28798

jreback · 2019-10-07T08:00:56Z

@kasparthommen this has not been merged yet; will be auto closed at that time

kasparthommen · 2019-10-07T10:07:52Z

@jreback oops, of course, sorry!

dsaxton mentioned this issue Oct 5, 2019

BUG: Keep categorical name in groupby #28798

Merged

4 tasks

jreback added Bug Categorical Categorical Data Type Groupby labels Oct 5, 2019

jreback added this to the 1.0 milestone Oct 5, 2019

kasparthommen closed this as completed Oct 7, 2019

jreback reopened this Oct 7, 2019

topper-123 closed this as completed in #28798 Oct 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame groupby with categoricals and aggreggation with pd.DataFrame.sum with skipna leads to wrong column name #28787

DataFrame groupby with categoricals and aggreggation with pd.DataFrame.sum with skipna leads to wrong column name #28787

kasparthommen commented Oct 4, 2019 •

edited

Loading

topper-123 commented Oct 5, 2019 •

edited

Loading

kasparthommen commented Oct 7, 2019

kasparthommen commented Oct 7, 2019 •

edited

Loading

jreback commented Oct 7, 2019

kasparthommen commented Oct 7, 2019

DataFrame groupby with categoricals and aggreggation with pd.DataFrame.sum with skipna leads to wrong column name #28787

DataFrame groupby with categoricals and aggreggation with pd.DataFrame.sum with skipna leads to wrong column name #28787

Comments

kasparthommen commented Oct 4, 2019 • edited Loading

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

topper-123 commented Oct 5, 2019 • edited Loading

kasparthommen commented Oct 7, 2019

kasparthommen commented Oct 7, 2019 • edited Loading

jreback commented Oct 7, 2019

kasparthommen commented Oct 7, 2019

kasparthommen commented Oct 4, 2019 •

edited

Loading

Output of `pd.show_versions()`

topper-123 commented Oct 5, 2019 •

edited

Loading

kasparthommen commented Oct 7, 2019 •

edited

Loading