BUG: `DataFrame.groupby.agg` has inconsistent behaviour depending on `DataFrame.groupby` `by`'s Iterable length and use of `DataFrame.groupby.agg`'s `*args`/`**kwargs` #47092

sondalex · 2022-05-22T16:58:15Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

One element in `by`:

import pandas as pd
import numpy as np

data = {'group_1':[1,1,2,2,3,4], 'group_2':["A", "B", "C", "C", "D", "D"], 
"x":np.random.uniform(-10,10, 6), "y":np.random.uniform(0,10,6)
}
df = pd.DataFrame(data)


df.groupby(["group_1"]).agg(lambda x : print(x))

The snippet above prints Series as intermediate objects.

0    A
1    B
Name: group_2, dtype: object
2    C
3    C
Name: group_2, dtype: object
4    D
Name: group_2, dtype: object
5    D
Name: group_2, dtype: object
0   -7.935897
1   -3.747197
Name: x, dtype: float64
2   -6.889313
3   -5.122488
Name: x, dtype: float64
4   -8.112173
Name: x, dtype: float64
5   -7.337516
Name: x, dtype: float64
0    8.209067
1    7.556609
Name: y, dtype: float64
2    3.595947
3    3.697471
Name: y, dtype: float64
4    2.58461
Name: y, dtype: float64
5    4.620535
Name: y, dtype: float64

When **kwargs are used, intermediate objects are DataFrame.

df.groupby(["group_1"]).agg(lambda x,y: print(x), y=1)

   group_1 group_2         x         y
0        1       A -7.935897  8.209067
1        1       B -3.747197  7.556609
   group_1 group_2         x         y
2        2       C -6.889313  3.595947
3        2       C -5.122488  3.697471
   group_1 group_2         x        y
4        3       D -8.112173  2.58461
   group_1 group_2         x         y
5        4       D -7.337516  4.620535

More than one element in `by`:

Here, no matter the use case, the intermediate objects are Series

df.groupby(["group_1", "group_2"]).agg(lambda x : print(x))

0    7.130014
Name: x, dtype: float64
1    8.12832
Name: x, dtype: float64
2   -7.127394
3   -8.320946
Name: x, dtype: float64
4   -4.383301
Name: x, dtype: float64
5    2.152988
Name: x, dtype: float64
0    3.789269
Name: y, dtype: float64
1    0.040574
Name: y, dtype: float64
2    2.372136
3    9.548322
Name: y, dtype: float64
4    3.798372
Name: y, dtype: float64
5    6.701758
Name: y, dtype: float64

df.groupby(["group_1", "group_2"]).agg(lambda x,y: print(x), y=1)

0    7.130014
Name: x, dtype: float64
1    8.12832
Name: x, dtype: float64
2   -7.127394
3   -8.320946
Name: x, dtype: float64
4   -4.383301
Name: x, dtype: float64
5    2.152988
Name: x, dtype: float64
0    3.789269
Name: y, dtype: float64
1    0.040574
Name: y, dtype: float64
2    2.372136
3    9.548322
Name: y, dtype: float64
4    3.798372
Name: y, dtype: float64
5    6.701758
Name: y, dtype: float64

Issue Description

The intermediate objects' type is not consistent when ITERABLE provided in df.groupby(by=<ITERABLE>).agg(func = lambda x:print(x), y=1) is of length one with respect to length bigger than one.

Expected Behavior

When ITERABLE is of length one and **kwargs/*args , intermediate objects should also be expected to be instances of Series.

Installed Version

commit : 4bfe3d0
python : 3.10.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.14.10-300.fc35.x86_64
Version : #1 SMP Thu Oct 7 20:48:44 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.4.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
pip : 22.1.1
setuptools : 57.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

simonjayhawkins · 2022-05-25T14:25:33Z

Thanks @sondalex for the report.

This does look like a bug and the different codepath taken for 1 groupby key with args/kwargs specified is here.

pandas/pandas/core/groupby/generic.py

Lines 885 to 892 in 7c054d6

    
           # grouper specific aggregations 
        
           if self.grouper.nkeys > 1: 
        
               # test_groupby_as_index_series_scalar gets here with 'not self.as_index' 
        
               return self._python_agg_general(func, *args, **kwargs) 
        
           elif args or kwargs: 
        
               # test_pass_args_kwargs gets here (with and without as_index) 
        
               # can't return early 
        
               result = self._aggregate_frame(func, *args, **kwargs)

with _aggregate_frame passing the DataFrame instead of the Series.

cc @rhshadrach

rhshadrach · 2022-05-26T21:11:30Z

Thanks for the report @sondalex. This is similar to (but more thorough than) #39169.

I agree that this is problematic and if there are to be different things passed to the UDF, the user should control this. The hard problem is how to go about changing behavior, partially because we don't know the UDFs users are using and partially because of how many ways this code can be hit (e.g. I believe df.resample(...).sum() and df.groupby(...).resample(...).sum() both hit it). Because of this, I'm of the opinion we should add an experimental option to enable new implementations of these methods where we can clean up this and a whole range of other things (#41112).

simonjayhawkins · 2022-05-27T09:38:16Z

Thanks @rhshadrach. Yes, I now see that you covered this in #39169 (comment)

will close as duplicate

This is similar to (but more thorough than)

As a simpler reproducer based on the above (could still be simplified further)

import pandas as pd
import numpy as np

data = {
    "group_1": [1, 1, 2, 2, 3, 4],
    "group_2": ["A", "B", "C", "C", "D", "D"],
    "x": np.random.uniform(-10, 10, 6),
    "y": np.random.uniform(0, 10, 6),
}
df = pd.DataFrame(data)

func = lambda x: type(x).__name__
func_with_kwarg = lambda x,y: type(x).__name__

grp= df.groupby(["group_1"])
print(grp.agg(func))
print(grp.agg(func_with_kwarg, y=1))

grp= df.groupby(["group_1", "group_2"])
print(grp.agg(func))
print(grp.agg(func_with_kwarg, y=1))

        group_2       x       y
group_1                        
1        Series  Series  Series
2        Series  Series  Series
3        Series  Series  Series
4        Series  Series  Series
           group_2          x          y
group_1                                 
1        DataFrame  DataFrame  DataFrame
2        DataFrame  DataFrame  DataFrame
3        DataFrame  DataFrame  DataFrame
4        DataFrame  DataFrame  DataFrame
                      x       y
group_1 group_2                
1       A        Series  Series
        B        Series  Series
2       C        Series  Series
3       D        Series  Series
4       D        Series  Series
                      x       y
group_1 group_2                
1       A        Series  Series
        B        Series  Series
2       C        Series  Series
3       D        Series  Series
4       D        Series  Series

simonjayhawkins added Bug Groupby Apply Apply, Aggregate, Transform, Map labels May 25, 2022

simonjayhawkins added this to the Contributions Welcome milestone May 25, 2022

simonjayhawkins closed this as completed May 27, 2022

simonjayhawkins added the Duplicate Report Duplicate issue or pull request label May 27, 2022

simonjayhawkins removed this from the Contributions Welcome milestone May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: `DataFrame.groupby.agg` has inconsistent behaviour depending on `DataFrame.groupby` `by`'s Iterable length and use of `DataFrame.groupby.agg`'s `*args`/`**kwargs` #47092

BUG: `DataFrame.groupby.agg` has inconsistent behaviour depending on `DataFrame.groupby` `by`'s Iterable length and use of `DataFrame.groupby.agg`'s `*args`/`**kwargs` #47092

sondalex commented May 22, 2022

simonjayhawkins commented May 25, 2022

rhshadrach commented May 26, 2022

simonjayhawkins commented May 27, 2022

BUG: DataFrame.groupby.agg has inconsistent behaviour depending on DataFrame.groupby by's Iterable length and use of DataFrame.groupby.agg's *args/**kwargs #47092

BUG: DataFrame.groupby.agg has inconsistent behaviour depending on DataFrame.groupby by's Iterable length and use of DataFrame.groupby.agg's *args/**kwargs #47092

Comments

sondalex commented May 22, 2022

Pandas version checks

Reproducible Example

One element in by:

More than one element in by:

Issue Description

Expected Behavior

Installed Version

simonjayhawkins commented May 25, 2022

rhshadrach commented May 26, 2022

simonjayhawkins commented May 27, 2022

BUG: `DataFrame.groupby.agg` has inconsistent behaviour depending on `DataFrame.groupby` `by`'s Iterable length and use of `DataFrame.groupby.agg`'s `*args`/`**kwargs` #47092

BUG: `DataFrame.groupby.agg` has inconsistent behaviour depending on `DataFrame.groupby` `by`'s Iterable length and use of `DataFrame.groupby.agg`'s `*args`/`**kwargs` #47092

One element in `by`:

More than one element in `by`: