Skip to content

API/DOC: clean up DataFrame.groupby.apply #22545

Open
@h-vetinari

Description

@h-vetinari

I'm very often working with df.groupby.apply(), and there are many confusing (sometimes wrong) aspects about the behaviour of the output, particularly regarding what happens with the index of the output. v.0.23 cleaned up big parts of the apply API, but there's still a lot left...

Ideally, I wish there'd be a sort of matrix (not necessarily in the following form) in the documentation - and implemented by the API - along the following lines

For as_index=True:

function output   |  result type  |  (multi-)index levels |  groupby-cols  |  columns
--------------------------------------------------------------------------------------------
scalar            |    Series     |    groupby-columns    |      n/a       |  none
Series            |   DataFrame   |    groupby-columns    |     dropped    |  index (union) of Series
DataFrame         |   DataFrame   |   gb-cols + df.index  |     dropped    |  columns (union) of DFs
np.ndarray 1-dim  |   DataFrame   |  to dicuss / raise ?  |      n/a       |  to dicuss / raise ?
np.ndarray 2-dim  |   DataFrame   |  to dicuss / raise ?  |      n/a       |  to dicuss / raise ?
Index             |  MultiIndex?  |   gb-cols + output    |      n/a       |  n/a

For as_index=False:

function output   |  result type  |  (multi-)index levels |  groupby-cols  |  columns
--------------------------------------------------------------------------------------------
scalar            |   DataFrame?  |      RangeIndex       |      n/a       |  gb-cols + output?
Series            |   DataFrame   |      RangeIndex       |      kept      |  gb-cols + index of Series?
DataFrame         |   DataFrame   |  to dicuss / raise ?  |      kept      |  gb-cols + columns of DFs
np.ndarray 1-dim  |   DataFrame   |  to dicuss / raise ?  |      n/a       |  to dicuss / raise ?
np.ndarray 2-dim  |   DataFrame   |  to dicuss / raise ?  |      n/a       |  to dicuss / raise ?
Index             |    Series?    |  to dicuss / raise ?  |      n/a       |  n/a

Currently, the behaviour is much, much more complicated / inconsistent / wrong. I'm trying to fill corresponding tables with the current behaviour and some issue xrefs, but it's by far not complete yet:

For as_index=True:

function output   |  result type  |  (multi-)index levels |  groupby-cols  |  columns
--------------------------------------------------------------------------------------------
scalar            |    Series     |    groupby-columns    |      n/a       |  none
Series (same idx) |   DataFrame   |    groupby-columns    |     kept?!     |  index of Series
Series (diff idx) |    Series?!   |  gb-cols + output.idx |      n/a       |  none?!
group as-is       |   DataFrame   |    original index?!   |     kept?!     |  original columns
group selection   |   DataFrame   |  gb-cols + output.idx |     kept?!     |  original columns
DataFrame         |   DataFrame   |  gb-cols + output.idx |      n/a       |  columns (union) of DFs
np.ndarray 1-dim  |    Series?!   |   groupby-columns     |      n/a       |  none
np.ndarray 2-dim  |    Series?!   |   groupby-columns     |      n/a       |  none
Index             |    Series?!   |   groupby-columns     |      n/a       |  none #22541

For as_index=False:

function output   |  result type  |  (multi-)index levels |  groupby-cols  |  columns
--------------------------------------------------------------------------------------------
scalar            |    Series     |      RangeIndex       |      n/a       |  none
Series (same idx) |   DataFrame   |      RangeIndex       |     kept       |  index of Series
Series (diff idx) |    Series?!   | RngIdx + output.idx?! |      n/a       |  none?!
group as-is       |   DataFrame   |    original index?!   |     kept       |  original columns
group selection   |   DataFrame   | RngIdx + output.idx?! |     kept       |  original columns
DataFrame         |   DataFrame   | RngIdx + output.idx?! |      n/a       |  columns (union) of DFs
np.ndarray 1-dim  |    Series?!   |      RangeIndex       |      n/a       |  none
np.ndarray 2-dim  |    Series?!   |      RangeIndex       |      n/a       |  none
Index             |    Series?!   |      RangeIndex       |      n/a       |  none #22541

Some xrefs: #20420, #22541, #22542, #22546

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions