Skip to content

BUG: groupby.apply result doesn't have consistent shape for single vs multiple group cases #42749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
oldrichsmejkal opened this issue Jul 27, 2021 · 4 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby

Comments

@oldrichsmejkal
Copy link

  • [ x ] I have checked that this issue has not already been reported.

  • [ x ] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd

# Prepare data, df_multi is multiple symbols
df_multi = pd.DataFrame(
    {'time': [1,2,1,2],
     'symbol': ['a','a','b','b'],
     'value1': [10, 12, 20, 21],
     'value2': [20, 22, 40, 41]
    }
).set_index(['time', 'symbol'])

df_single = df_multi[df_multi.index.get_level_values('symbol') == 'a'].copy()


res_multi = df_multi.groupby(level = 'symbol').apply(lambda x: x['value1']) # we are applying trivial function here, but goal is here can be complex calculation
print(res_multi)

res_single = df_single.groupby(level = 'symbol').apply(lambda x: x['value1'])
print(res_single)

Problem description

res_multi and res_single have inconsistent shapes.
Output in res_multi is correct, output in res_single have index level[0] in columns instead on in index.
Output seems to be consistent, after calling something like (but I am not sure how general it is..): res_single.swapaxis(level = 1).unstack()
So something like this is possible fix.
(I believe it should be handled in pandas, because otherwise all downstream code have to handle single group case separately.)

Output of pd.show_versions()

INSTALLED VERSIONS

commit : c7f7443
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.13.5-xanmod1
Version : #0~git20210725.ec5a2ac SMP PREEMPT Sun Jul 25 17:24:47 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.1
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 52.0.0.post20210125
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: 0.10.0
bs4 : None
bottleneck : 1.3.2
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : 1.4.19
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : 0.53.1

@oldrichsmejkal oldrichsmejkal added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 27, 2021
@oldrichsmejkal oldrichsmejkal changed the title BUG: groupby.apply result doesnt't have consistent shape for single vs multi group cases BUG: groupby.apply result doesn't have consistent shape for single vs multiple group cases Jul 27, 2021
@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map Groupby labels Jul 28, 2021
@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 21, 2021
@rhshadrach
Copy link
Member

This seems to me to be an example where apply's inference cannot succeed in all cases. In such a case, I would typically suggest using groupby.transform instead as it no longer needs to infer; however we can't here because transform fails.

In this case, the fast_path of DataFrameGroupBy._choose_path succeeds but the slowpath fails. In such a case, we could go forward with the fast_path instead of raising like we currently do. This is somewhat safe as it would only allow code that previously raised an error to now work.

However, it seems to me the fast_path / slow_path makes pandas prone to inconsistencies, depending on path chosen, and hard for users to use as what object their transformer gets depends on how it behaves. Instead, could we allow user control of this by adding an argument, e.g.:

 df_multi.groupby(level = 'symbol').transform(lambda x: x['value1'], by='group')  # "fast-path"

vs

 df_multi.groupby(level = 'symbol').transform(lambda x: x['value1'], by='column')  # "slow-path"

This would at least give control, and the user would know whether the UDF provided should accept a DataFrame or Series.

@bherwerth
Copy link
Contributor

I ran into this problem as well these days.

On the original approach with apply, I investigated in a debugger. I think the two cases take different paths after this line:

if not all_indexed_same:

The variable all_indexed_same checks whether all series produced by applying the function on the groups have the same index. When there is only one group and thus only one series, this variable is True, otherwise it is False. And when all_indexed_same is True, then groupby arranges the series in rows of the output data frame.

I doubt a fix is possible without introducing an inconsistency for the use case where all_indexed_same is intentionally True. Maybe one could check whether the single series produced on the one group has the same index as the original data frame and then continue as for not all_indexed_same. But then there could be edge cases where this is the case by accident, which would introduce another inconsistency...

On the approach of using transform, my understanding was that transform expects a function that does not modify the shape or is broadcastable to the original shape of the group's data frame. In this issue's case, the function takes a data frame and returns a series. In this case, would one then have to switch off broadcasting?

@rhshadrach
Copy link
Member

rhshadrach commented Aug 6, 2022

my understanding was that transform expects a function that does not modify the shape or is broadcastable to the original shape of the group's data frame...

This is currently true. Also, the function must be able to act column-by-column. What I'm discussing is removing these restrictions. To me, even something like the following makes sense as a transform, and currently works with the fast-path (have to also disable reindexing the columns of the result):

def foo(x):
    return pd.concat([x['b'], x['b'].rename('y') + 2, x['c']], axis=1)

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5], 'c': [7, 8, 9]})
print(df.groupby('a').transform(foo))

#    b  y  c
# 0  3  5  7
# 1  4  6  8
# 2  5  7  9

In this case, would one then have to switch off broadcasting?

I don't follow why this would need to be done - can you expand on this? The key feature needed about the result from the supplied UDF is that it needs to be able to be either broadcasted or aligned to the input group's index. However thought needs to be put into the logic of whether to broadcast or align.

@bherwerth
Copy link
Contributor

What you describe seems like a useful extension of transform to me.

I don't follow why this would need to be done - can you expand on this? The key feature needed about the result from the supplied UDF is that it needs to be able to be either broadcasted or aligned to the input group's index.

Broadcasting in the sense of aligning to the group's index totally makes sense to me. I was more thinking in the logic of the current implementation, which requires to also preserve the columns: If the group is a data frame of shape (m, n), the UDF returns a series of length m, then the series could be coerced to the original shape by interpreting it as shape (m, 1) and broadcasting to (m, n).

However, with the extension that you describe, only the index would need to be preserved and it would be okay to have different columns. Then my comment about broadcasting would not apply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

No branches or pull requests

4 participants