REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation #43380

joseph-wakeling-frequenz · 2021-09-03T11:57:04Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

Adapted from "Automatic exclusion of nuisance columns" in the User Guide "Group by" docs:

from decimal import Decimal

import pandas as pd


df_dec = pd.DataFrame(
    {
        "id": [1, 2, 1, 2],
        "int_column": [1, 2, 3, 4],
        "dec_column": [
            Decimal("0.50"),
            Decimal("0.15"),
            Decimal("0.25"),
            Decimal("0.40"),
        ],
    }
)

print('\ndf_dec.groupby(["id"])[["dec_column"]].sum()')
print('According to docs this should sum correctly')
print('It works for pandas 1.2.x but generates an empty dataframe for 1.3.x')
print(df_dec.groupby(["id"])[["dec_column"]].sum())

print('\ndf_dec.groupby(["id"])[["int_column", "dec_column"]].sum()')
print('This drops `dec_column` as expected for both 1.2.x and 1.3.x')
print(df_dec.groupby(["id"])[["int_column", "dec_column"]].sum())

print('\ndf_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"})')
print('This aggregates everything correctly as expected for both 1.2.x and 1.3.x')
print(df_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"}))

Problem description

The User Guide "Group by" docs provides a code example that shows when nuisance columns will be excluded from aggregation. According to this doc the case:

df_dec.groupby(["id"])[["dec_column"]].sum()

should produce a valid aggregation, but for pandas >= 1.3.0 it results in an empty dataframe.

The impact of the regression can even be seen in the published docs. If we look an archive.org 2021-02-25 snapshot of the "Automatic exclusion of nuisance columns" section, we can see that the example produces correct output (see Out[170] in the code example):
https://web.archive.org/web/20210225195813/https://pandas.pydata.org/docs/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

By contrast the 2021-08-24 snapshot displays an empty dataframe for the Out[170] example:
https://web.archive.org/web/20210824151314/https://pandas.pydata.org/docs/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

However, note that the docs in both cases indicate that the example should produce a correct aggregation.

Expected Output

df_dec.groupby(["id"])[["dec_column"]].sum()

should produce the result:

   dec_column
id           
1        0.75
2        0.55

For pandas 1.2.x it does so as expected. For pandas >= 1.3.0 it produces instead the incorrect

Empty DataFrame
Columns: []
Index: [1, 2]

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2021-09-03T12:54:24Z

Thanks @joseph-wakeling-frequenz for the report. duplicate of #42395?

joseph-wakeling-frequenz · 2021-09-03T12:57:48Z

Looks like it could be -- sorry for not spotting that. However I think highlighting the discrepancy in the User Guide "Group by" docs may be new, and should be resolved as part of that work?

simonjayhawkins · 2021-09-03T13:00:27Z

However I think highlighting the discrepancy in the User Guide "Group by" docs may be new, and should be resolved as part of that work?

Thanks for spotting that.

The docs are automatically generated so once the regression is fixed, the doc issue should also be fixed.

simonjayhawkins · 2021-09-03T13:03:03Z

I'll leave this open til the docs are fixed.

simonjayhawkins · 2021-09-11T11:02:28Z

fixed in #43154 and docs now render as before https://pandas.pydata.org/pandas-docs/dev/user_guide/groupby.html#other-useful-features (stables docs will get updated with 1.3.3 release)

joseph-wakeling-frequenz · 2021-09-13T12:31:25Z

Fantastic! Thank you @simonjayhawkins and to the contributors and reviewers involved.

joseph-wakeling-frequenz added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2021

joseph-wakeling-frequenz changed the title ~~REGR: Invalid exclusion of nuisance columns with groupby aggregation~~ REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation Sep 3, 2021

simonjayhawkins added this to the 1.3.3 milestone Sep 3, 2021

simonjayhawkins added Docs Regression Functionality that used to work in a prior pandas version labels Sep 3, 2021

simonjayhawkins closed this as completed Sep 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation #43380

REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation #43380

joseph-wakeling-frequenz commented Sep 3, 2021

simonjayhawkins commented Sep 3, 2021

joseph-wakeling-frequenz commented Sep 3, 2021

simonjayhawkins commented Sep 3, 2021

simonjayhawkins commented Sep 3, 2021

simonjayhawkins commented Sep 11, 2021

joseph-wakeling-frequenz commented Sep 13, 2021

REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation #43380

REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation #43380

Comments

joseph-wakeling-frequenz commented Sep 3, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

simonjayhawkins commented Sep 3, 2021

joseph-wakeling-frequenz commented Sep 3, 2021

simonjayhawkins commented Sep 3, 2021

simonjayhawkins commented Sep 3, 2021

simonjayhawkins commented Sep 11, 2021

joseph-wakeling-frequenz commented Sep 13, 2021