Skip to content

REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation #43380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
joseph-wakeling-frequenz opened this issue Sep 3, 2021 · 6 comments
Closed
2 of 3 tasks
Labels
Bug Docs Duplicate Report Duplicate issue or pull request Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@joseph-wakeling-frequenz
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

Adapted from "Automatic exclusion of nuisance columns" in the User Guide "Group by" docs:

from decimal import Decimal

import pandas as pd


df_dec = pd.DataFrame(
    {
        "id": [1, 2, 1, 2],
        "int_column": [1, 2, 3, 4],
        "dec_column": [
            Decimal("0.50"),
            Decimal("0.15"),
            Decimal("0.25"),
            Decimal("0.40"),
        ],
    }
)

print('\ndf_dec.groupby(["id"])[["dec_column"]].sum()')
print('According to docs this should sum correctly')
print('It works for pandas 1.2.x but generates an empty dataframe for 1.3.x')
print(df_dec.groupby(["id"])[["dec_column"]].sum())

print('\ndf_dec.groupby(["id"])[["int_column", "dec_column"]].sum()')
print('This drops `dec_column` as expected for both 1.2.x and 1.3.x')
print(df_dec.groupby(["id"])[["int_column", "dec_column"]].sum())

print('\ndf_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"})')
print('This aggregates everything correctly as expected for both 1.2.x and 1.3.x')
print(df_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"}))

Problem description

The User Guide "Group by" docs provides a code example that shows when nuisance columns will be excluded from aggregation. According to this doc the case:

df_dec.groupby(["id"])[["dec_column"]].sum()

should produce a valid aggregation, but for pandas >= 1.3.0 it results in an empty dataframe.

The impact of the regression can even be seen in the published docs. If we look an archive.org 2021-02-25 snapshot of the "Automatic exclusion of nuisance columns" section, we can see that the example produces correct output (see Out[170] in the code example):
https://web.archive.org/web/20210225195813/https://pandas.pydata.org/docs/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

By contrast the 2021-08-24 snapshot displays an empty dataframe for the Out[170] example:
https://web.archive.org/web/20210824151314/https://pandas.pydata.org/docs/user_guide/groupby.html#automatic-exclusion-of-nuisance-columns

However, note that the docs in both cases indicate that the example should produce a correct aggregation.

Expected Output

df_dec.groupby(["id"])[["dec_column"]].sum()

should produce the result:

   dec_column
id           
1        0.75
2        0.55

For pandas 1.2.x it does so as expected. For pandas >= 1.3.0 it produces instead the incorrect

Empty DataFrame
Columns: []
Index: [1, 2]
@joseph-wakeling-frequenz joseph-wakeling-frequenz added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2021
@joseph-wakeling-frequenz joseph-wakeling-frequenz changed the title REGR: Invalid exclusion of nuisance columns with groupby aggregation REGR: 1.3 invalid exclusion of nuisance columns with groupby aggregation Sep 3, 2021
@simonjayhawkins
Copy link
Member

Thanks @joseph-wakeling-frequenz for the report. duplicate of #42395?

@simonjayhawkins simonjayhawkins added Duplicate Report Duplicate issue or pull request Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2021
@joseph-wakeling-frequenz
Copy link
Author

Looks like it could be -- sorry for not spotting that. However I think highlighting the discrepancy in the User Guide "Group by" docs may be new, and should be resolved as part of that work?

@simonjayhawkins
Copy link
Member

However I think highlighting the discrepancy in the User Guide "Group by" docs may be new, and should be resolved as part of that work?

Thanks for spotting that.

The docs are automatically generated so once the regression is fixed, the doc issue should also be fixed.

@simonjayhawkins simonjayhawkins added this to the 1.3.3 milestone Sep 3, 2021
@simonjayhawkins simonjayhawkins added Docs Regression Functionality that used to work in a prior pandas version labels Sep 3, 2021
@simonjayhawkins
Copy link
Member

I'll leave this open til the docs are fixed.

@simonjayhawkins
Copy link
Member

fixed in #43154 and docs now render as before https://pandas.pydata.org/pandas-docs/dev/user_guide/groupby.html#other-useful-features (stables docs will get updated with 1.3.3 release)

@joseph-wakeling-frequenz
Copy link
Author

Fantastic! Thank you @simonjayhawkins and to the contributors and reviewers involved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Docs Duplicate Report Duplicate issue or pull request Groupby Nuisance Columns Identifying/Dropping nuisance columns in reductions, groupby.add, DataFrame.apply Reduction Operations sum, mean, min, max, etc. Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

2 participants