Open
Description
I'm very often working with df.groupby.apply()
, and there are many confusing (sometimes wrong) aspects about the behaviour of the output, particularly regarding what happens with the index of the output. v.0.23
cleaned up big parts of the apply
API, but there's still a lot left...
Ideally, I wish there'd be a sort of matrix (not necessarily in the following form) in the documentation - and implemented by the API - along the following lines
For as_index=True
:
function output | result type | (multi-)index levels | groupby-cols | columns
--------------------------------------------------------------------------------------------
scalar | Series | groupby-columns | n/a | none
Series | DataFrame | groupby-columns | dropped | index (union) of Series
DataFrame | DataFrame | gb-cols + df.index | dropped | columns (union) of DFs
np.ndarray 1-dim | DataFrame | to dicuss / raise ? | n/a | to dicuss / raise ?
np.ndarray 2-dim | DataFrame | to dicuss / raise ? | n/a | to dicuss / raise ?
Index | MultiIndex? | gb-cols + output | n/a | n/a
For as_index=False
:
function output | result type | (multi-)index levels | groupby-cols | columns
--------------------------------------------------------------------------------------------
scalar | DataFrame? | RangeIndex | n/a | gb-cols + output?
Series | DataFrame | RangeIndex | kept | gb-cols + index of Series?
DataFrame | DataFrame | to dicuss / raise ? | kept | gb-cols + columns of DFs
np.ndarray 1-dim | DataFrame | to dicuss / raise ? | n/a | to dicuss / raise ?
np.ndarray 2-dim | DataFrame | to dicuss / raise ? | n/a | to dicuss / raise ?
Index | Series? | to dicuss / raise ? | n/a | n/a
Currently, the behaviour is much, much more complicated / inconsistent / wrong. I'm trying to fill corresponding tables with the current behaviour and some issue xrefs, but it's by far not complete yet:
For as_index=True
:
function output | result type | (multi-)index levels | groupby-cols | columns
--------------------------------------------------------------------------------------------
scalar | Series | groupby-columns | n/a | none
Series (same idx) | DataFrame | groupby-columns | kept?! | index of Series
Series (diff idx) | Series?! | gb-cols + output.idx | n/a | none?!
group as-is | DataFrame | original index?! | kept?! | original columns
group selection | DataFrame | gb-cols + output.idx | kept?! | original columns
DataFrame | DataFrame | gb-cols + output.idx | n/a | columns (union) of DFs
np.ndarray 1-dim | Series?! | groupby-columns | n/a | none
np.ndarray 2-dim | Series?! | groupby-columns | n/a | none
Index | Series?! | groupby-columns | n/a | none #22541
For as_index=False
:
function output | result type | (multi-)index levels | groupby-cols | columns
--------------------------------------------------------------------------------------------
scalar | Series | RangeIndex | n/a | none
Series (same idx) | DataFrame | RangeIndex | kept | index of Series
Series (diff idx) | Series?! | RngIdx + output.idx?! | n/a | none?!
group as-is | DataFrame | original index?! | kept | original columns
group selection | DataFrame | RngIdx + output.idx?! | kept | original columns
DataFrame | DataFrame | RngIdx + output.idx?! | n/a | columns (union) of DFs
np.ndarray 1-dim | Series?! | RangeIndex | n/a | none
np.ndarray 2-dim | Series?! | RangeIndex | n/a | none
Index | Series?! | RangeIndex | n/a | none #22541