BUG: groupby.apply result doesn't have consistent shape for single vs multiple group cases #42749

oldrichsmejkal · 2021-07-27T14:37:54Z

[ x ] I have checked that this issue has not already been reported.
[ x ] I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

# Prepare data, df_multi is multiple symbols
df_multi = pd.DataFrame(
    {'time': [1,2,1,2],
     'symbol': ['a','a','b','b'],
     'value1': [10, 12, 20, 21],
     'value2': [20, 22, 40, 41]
    }
).set_index(['time', 'symbol'])

df_single = df_multi[df_multi.index.get_level_values('symbol') == 'a'].copy()


res_multi = df_multi.groupby(level = 'symbol').apply(lambda x: x['value1']) # we are applying trivial function here, but goal is here can be complex calculation
print(res_multi)

res_single = df_single.groupby(level = 'symbol').apply(lambda x: x['value1'])
print(res_single)

Problem description

res_multi and res_single have inconsistent shapes.
Output in res_multi is correct, output in res_single have index level[0] in columns instead on in index.
Output seems to be consistent, after calling something like (but I am not sure how general it is..): res_single.swapaxis(level = 1).unstack()
So something like this is possible fix.
(I believe it should be handled in pandas, because otherwise all downstream code have to handle single group case separately.)

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : c7f7443
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.5-xanmod1
Version : #0~git20210725.ec5a2ac SMP PREEMPT Sun Jul 25 17:24:47 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.1
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 52.0.0.post20210125
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: 0.10.0
bs4 : None
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : 1.4.19
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : 0.53.1

The text was updated successfully, but these errors were encountered:

rhshadrach · 2022-08-02T23:07:51Z

This seems to me to be an example where apply's inference cannot succeed in all cases. In such a case, I would typically suggest using groupby.transform instead as it no longer needs to infer; however we can't here because transform fails.

In this case, the fast_path of DataFrameGroupBy._choose_path succeeds but the slowpath fails. In such a case, we could go forward with the fast_path instead of raising like we currently do. This is somewhat safe as it would only allow code that previously raised an error to now work.

However, it seems to me the fast_path / slow_path makes pandas prone to inconsistencies, depending on path chosen, and hard for users to use as what object their transformer gets depends on how it behaves. Instead, could we allow user control of this by adding an argument, e.g.:

 df_multi.groupby(level = 'symbol').transform(lambda x: x['value1'], by='group')  # "fast-path"

vs

 df_multi.groupby(level = 'symbol').transform(lambda x: x['value1'], by='column')  # "slow-path"

This would at least give control, and the user would know whether the UDF provided should accept a DataFrame or Series.

bherwerth · 2022-08-05T19:17:14Z

I ran into this problem as well these days.

On the original approach with apply, I investigated in a debugger. I think the two cases take different paths after this line:

pandas/pandas/core/groupby/generic.py

Line 1109 in 7214f85

if not all_indexed_same:

The variable all_indexed_same checks whether all series produced by applying the function on the groups have the same index. When there is only one group and thus only one series, this variable is True, otherwise it is False. And when all_indexed_same is True, then groupby arranges the series in rows of the output data frame.

I doubt a fix is possible without introducing an inconsistency for the use case where all_indexed_same is intentionally True. Maybe one could check whether the single series produced on the one group has the same index as the original data frame and then continue as for not all_indexed_same. But then there could be edge cases where this is the case by accident, which would introduce another inconsistency...

On the approach of using transform, my understanding was that transform expects a function that does not modify the shape or is broadcastable to the original shape of the group's data frame. In this issue's case, the function takes a data frame and returns a series. In this case, would one then have to switch off broadcasting?

rhshadrach · 2022-08-06T13:02:58Z

my understanding was that transform expects a function that does not modify the shape or is broadcastable to the original shape of the group's data frame...

This is currently true. Also, the function must be able to act column-by-column. What I'm discussing is removing these restrictions. To me, even something like the following makes sense as a transform, and currently works with the fast-path (have to also disable reindexing the columns of the result):

def foo(x):
    return pd.concat([x['b'], x['b'].rename('y') + 2, x['c']], axis=1)

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': [7, 8, 9]})
print(df.groupby('a').transform(foo))

#    b  y  c
# 0  3  5  7
# 1  4  6  8
# 2  5  7  9

In this case, would one then have to switch off broadcasting?

I don't follow why this would need to be done - can you expand on this? The key feature needed about the result from the supplied UDF is that it needs to be able to be either broadcasted or aligned to the input group's index. However thought needs to be put into the logic of whether to broadcast or align.

bherwerth · 2022-08-07T14:31:14Z

What you describe seems like a useful extension of transform to me.

I don't follow why this would need to be done - can you expand on this? The key feature needed about the result from the supplied UDF is that it needs to be able to be either broadcasted or aligned to the input group's index.

Broadcasting in the sense of aligning to the group's index totally makes sense to me. I was more thinking in the logic of the current implementation, which requires to also preserve the columns: If the group is a data frame of shape (m, n), the UDF returns a series of length m, then the series could be coerced to the original shape by interpreting it as shape (m, 1) and broadcasting to (m, n).

However, with the extension that you describe, only the index would need to be preserved and it would be okay to have different columns. Then my comment about broadcasting would not apply.

oldrichsmejkal added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 27, 2021

oldrichsmejkal changed the title ~~BUG: groupby.apply result doesnt't have consistent shape for single vs multi group cases~~ BUG: groupby.apply result doesn't have consistent shape for single vs multiple group cases Jul 27, 2021

rhshadrach added Apply Apply, Aggregate, Transform, Map Groupby labels Jul 28, 2021

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 21, 2021

simonjayhawkins mentioned this issue Jun 10, 2022

BUG: groupby apply on single group transposes series output #45187

Closed

3 tasks

rhshadrach mentioned this issue May 12, 2023

BUG: DataFrame.groupby().apply() with single group results in wrong output shape #53163

Closed

3 tasks

This was referenced Aug 29, 2023

BUG: df.groupby().apply() behaves inconsistently when there's only one group #54723

Closed

BUG: pandas groupby.apply return type changes depending on number of unique groupkeys for same custom-function #54992

Open

rhshadrach mentioned this issue Dec 19, 2023

BUG: DataFrame groupby apply produces a wide DataFrame instead of Series if there happens to be only a single group #56562

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby.apply result doesn't have consistent shape for single vs multiple group cases #42749

BUG: groupby.apply result doesn't have consistent shape for single vs multiple group cases #42749

oldrichsmejkal commented Jul 27, 2021

INSTALLED VERSIONS

rhshadrach commented Aug 2, 2022

bherwerth commented Aug 5, 2022

rhshadrach commented Aug 6, 2022 •

edited

Loading

bherwerth commented Aug 7, 2022

BUG: groupby.apply result doesn't have consistent shape for single vs multiple group cases #42749

BUG: groupby.apply result doesn't have consistent shape for single vs multiple group cases #42749

Comments

oldrichsmejkal commented Jul 27, 2021

Code Sample, a copy-pastable example

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

rhshadrach commented Aug 2, 2022

bherwerth commented Aug 5, 2022

rhshadrach commented Aug 6, 2022 • edited Loading

bherwerth commented Aug 7, 2022

Output of `pd.show_versions()`

rhshadrach commented Aug 6, 2022 •

edited

Loading