Skip to content

DataFrame.GroupBy.apply has unexpected returns in some cases #31063

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
xin-jin opened this issue Jan 16, 2020 · 6 comments
Open

DataFrame.GroupBy.apply has unexpected returns in some cases #31063

xin-jin opened this issue Jan 16, 2020 · 6 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby

Comments

@xin-jin
Copy link

xin-jin commented Jan 16, 2020

Code Sample, a copy-pastable example if possible

# Your code here
def agg(a, b):
    return a + b

x = pd.DataFrame({'A': np.arange(10), 'B': [1] * 10, 'C': np.random.rand(10), 'D': np.random.rand(10)}).set_index(['A', 'B'])

x.groupby('B').apply(lambda g: g.C + g.D)

Problem description

This returns a DataFrame of shape (1, 10)

This seems to occur when i) The groupby key happens to have a unique value ii) The apply function takes a DataFrame and returns a Series.

Expected Output

I expect it returns a Series of shape (10, 1)

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-72-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: 0.8.1
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1

@dsaxton
Copy link
Member

dsaxton commented Jan 19, 2020

It looks like there's a check for this type of case here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L1268 that sends the code down separate paths depending on the number of unique values. In the case where we have one the result is explicitly unstacked here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L1306.

I'm not sure if this is by design with some other situation in mind but I'd agree the shape of the output shouldn't depend on the cardinality of the thing you're grouping on.

@MarcoGorelli
Copy link
Member

Agree that it seems strange:

In [3]: x = pd.DataFrame({'A': np.arange(10), 'B': [1] * 10, 'C': np.random.rand(10), 'D': np.random.rand(10)}).set_index(['A', 'B']) 
   ...:  
   ...: x.groupby('B').apply(lambda g: g.C + g.D)                                                                                                                                                      
Out[3]: 
A         0         1         2         3        4         5         6         7         8         9
B         1         1         1         1        1         1         1         1         1         1
B                                                                                                   
1  1.564408  1.249584  1.567012  0.979613  0.34113  1.051774  0.370114  1.775035  1.073833  1.651014

In [4]: x = pd.DataFrame({'A': np.arange(10), 'B': [1] * 9 + [2], 'C': np.random.rand(10), 'D': np.random.rand(10)}).set_index(['A', 'B']) 
   ...:  
   ...: x.groupby('B').apply(lambda g: g.C + g.D)                                                                                                                                                      
Out[4]: 
B  A  B
1  0  1    0.237817
   1  1    0.312057
   2  1    1.938954
   3  1    1.452997
   4  1    1.379014
   5  1    0.161306
   6  1    1.108117
   7  1    1.597151
   8  1    0.993974
2  9  2    1.191945
dtype: float64

@MarcoGorelli
Copy link
Member

I expect it returns a Series of shape (10, 1)

Though I think here it should be (10, ) for compatibility with when the groupby key doesn't have a unique value

@MennovDijk
Copy link

Is there any workaround for this issue currently in place without me having to modify the internals of Pandas? I personally never want my output to be formatted as (1,10), but always as (10,1). I could just modify the code below to force all_indexed_same to be False always, but this seems rather obtrusive. Thanks!

def _wrap_applied_output_series(
self,
keys,
values: List[Series],
not_indexed_same: bool,
first_not_none,
key_index,
) -> FrameOrSeriesUnion:
# this is to silence a DeprecationWarning
# TODO: Remove when default dtype of empty Series is object
kwargs = first_not_none._construct_axes_dict()
backup = create_series_with_explicit_dtype(dtype_if_empty=object, **kwargs)
values = [x if (x is not None) else backup for x in values]
all_indexed_same = all_indexes_same(x.index for x in values)

@krlng
Copy link

krlng commented Jan 6, 2021

My bad workaround right now is a simple loop - but I really don't like it, its less pythonic and much slower in case of many groups. How ever, modifying the internals of pandas is not an option to me. In my point of view, this makes apply simply not usable for productive data-pipelines.

@pspeter
Copy link

pspeter commented Apr 7, 2021

I also ran into this issue. My workaround was to append a call to pipe to force the output to be a Series:

df.groupby(...).apply(...).pipe(
    lambda df_or_series: df_or_series
    if isinstance(df_or_series, pd.Series)
    else df_or_series.iloc[0, :]
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

No branches or pull requests

8 participants