DataFrame.GroupBy.apply has unexpected returns in some cases #31063

xin-jin · 2020-01-16T04:04:48Z

Code Sample, a copy-pastable example if possible

# Your code here
def agg(a, b):
    return a + b

x = pd.DataFrame({'A': np.arange(10), 'B': [1] * 10, 'C': np.random.rand(10), 'D': np.random.rand(10)}).set_index(['A', 'B'])

x.groupby('B').apply(lambda g: g.C + g.D)

Problem description

This returns a DataFrame of shape (1, 10)

This seems to occur when i) The groupby key happens to have a unique value ii) The apply function takes a DataFrame and returns a Series.

Expected Output

I expect it returns a Series of shape (10, 1)

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-72-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.17.2
pytz : 2019.3
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.4.0
Cython : 0.29.13
pytest : 5.2.1
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : 1.2.1
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.8.0
pandas_datareader: 0.8.1
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.0
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.9
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.1

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-01-19T17:15:42Z

It looks like there's a check for this type of case here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L1268 that sends the code down separate paths depending on the number of unique values. In the case where we have one the result is explicitly unstacked here https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/generic.py#L1306.

I'm not sure if this is by design with some other situation in mind but I'd agree the shape of the output shouldn't depend on the cardinality of the thing you're grouping on.

MarcoGorelli · 2020-03-09T17:55:17Z

Agree that it seems strange:

In [3]: x = pd.DataFrame({'A': np.arange(10), 'B': [1] * 10, 'C': np.random.rand(10), 'D': np.random.rand(10)}).set_index(['A', 'B']) 
   ...:  
   ...: x.groupby('B').apply(lambda g: g.C + g.D)                                                                                                                                                      
Out[3]: 
A         0         1         2         3        4         5         6         7         8         9
B         1         1         1         1        1         1         1         1         1         1
B                                                                                                   
1  1.564408  1.249584  1.567012  0.979613  0.34113  1.051774  0.370114  1.775035  1.073833  1.651014

In [4]: x = pd.DataFrame({'A': np.arange(10), 'B': [1] * 9 + [2], 'C': np.random.rand(10), 'D': np.random.rand(10)}).set_index(['A', 'B']) 
   ...:  
   ...: x.groupby('B').apply(lambda g: g.C + g.D)                                                                                                                                                      
Out[4]: 
B  A  B
1  0  1    0.237817
   1  1    0.312057
   2  1    1.938954
   3  1    1.452997
   4  1    1.379014
   5  1    0.161306
   6  1    1.108117
   7  1    1.597151
   8  1    0.993974
2  9  2    1.191945
dtype: float64

MarcoGorelli · 2020-03-09T21:16:01Z

I expect it returns a Series of shape (10, 1)

Though I think here it should be (10, ) for compatibility with when the groupby key doesn't have a unique value

MennovDijk · 2021-01-03T13:14:27Z

Is there any workaround for this issue currently in place without me having to modify the internals of Pandas? I personally never want my output to be formatted as (1,10), but always as (10,1). I could just modify the code below to force all_indexed_same to be False always, but this seems rather obtrusive. Thanks!

pandas/pandas/core/groupby/generic.py

Lines 1213 to 1227 in 46841f8

    
           def _wrap_applied_output_series( 
        
               self, 
        
               keys, 
        
               values: List[Series], 
        
               not_indexed_same: bool, 
        
               first_not_none, 
        
               key_index, 
        
           ) -> FrameOrSeriesUnion: 
        
               # this is to silence a DeprecationWarning 
        
               # TODO: Remove when default dtype of empty Series is object 
        
               kwargs = first_not_none._construct_axes_dict() 
        
               backup = create_series_with_explicit_dtype(dtype_if_empty=object, **kwargs) 
        
               values = [x if (x is not None) else backup for x in values] 
        
               all_indexed_same = all_indexes_same(x.index for x in values)

krlng · 2021-01-06T11:05:58Z

My bad workaround right now is a simple loop - but I really don't like it, its less pythonic and much slower in case of many groups. How ever, modifying the internals of pandas is not an option to me. In my point of view, this makes apply simply not usable for productive data-pipelines.

pspeter · 2021-04-07T06:53:33Z

I also ran into this issue. My workaround was to append a call to pipe to force the output to be a Series:

df.groupby(...).apply(...).pipe(
    lambda df_or_series: df_or_series
    if isinstance(df_or_series, pd.Series)
    else df_or_series.iloc[0, :]
)

This was referenced Jan 19, 2020

BUG: dont' always coerce reductions in a groupby always to datetimes #5790

Merged

BUG: apply idxmax on one-column DataFrameGroupby generates ValueError #5788

Closed

jbrockmendel added Apply Apply, Aggregate, Transform, Map Groupby labels Feb 25, 2020

mroeschke added the Bug label Jun 28, 2020

rhshadrach mentioned this issue Aug 25, 2020

BUG: groupby.apply() unexpectedly returns DataFrame when Series is explicitly specified #35782

Closed

3 tasks

Madhu94 mentioned this issue Feb 25, 2021

Dask apply throwing error using right meta dask/dask#6068

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.GroupBy.apply has unexpected returns in some cases #31063

DataFrame.GroupBy.apply has unexpected returns in some cases #31063

xin-jin commented Jan 16, 2020

INSTALLED VERSIONS

dsaxton commented Jan 19, 2020

MarcoGorelli commented Mar 9, 2020

MarcoGorelli commented Mar 9, 2020

MennovDijk commented Jan 3, 2021

krlng commented Jan 6, 2021

pspeter commented Apr 7, 2021

DataFrame.GroupBy.apply has unexpected returns in some cases #31063

DataFrame.GroupBy.apply has unexpected returns in some cases #31063

Comments

xin-jin commented Jan 16, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

dsaxton commented Jan 19, 2020

MarcoGorelli commented Mar 9, 2020

MarcoGorelli commented Mar 9, 2020

MennovDijk commented Jan 3, 2021

krlng commented Jan 6, 2021

pspeter commented Apr 7, 2021

Output of `pd.show_versions()`