Skip to content

BUG: DataFrame.groupby.agg has inconsistent behaviour depending on DataFrame.groupby by's Iterable length and use of DataFrame.groupby.agg's *args/**kwargs #47092

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
sondalex opened this issue May 22, 2022 · 3 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Duplicate Report Duplicate issue or pull request Groupby

Comments

@sondalex
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

One element in by:

import pandas as pd
import numpy as np

data = {'group_1':[1,1,2,2,3,4], 'group_2':["A", "B", "C", "C", "D", "D"], 
"x":np.random.uniform(-10,10, 6), "y":np.random.uniform(0,10,6)
}
df = pd.DataFrame(data)


df.groupby(["group_1"]).agg(lambda x : print(x))

The snippet above prints Series as intermediate objects.

0    A
1    B
Name: group_2, dtype: object
2    C
3    C
Name: group_2, dtype: object
4    D
Name: group_2, dtype: object
5    D
Name: group_2, dtype: object
0   -7.935897
1   -3.747197
Name: x, dtype: float64
2   -6.889313
3   -5.122488
Name: x, dtype: float64
4   -8.112173
Name: x, dtype: float64
5   -7.337516
Name: x, dtype: float64
0    8.209067
1    7.556609
Name: y, dtype: float64
2    3.595947
3    3.697471
Name: y, dtype: float64
4    2.58461
Name: y, dtype: float64
5    4.620535
Name: y, dtype: float64

When **kwargs are used, intermediate objects are DataFrame.

df.groupby(["group_1"]).agg(lambda x,y: print(x), y=1)
   group_1 group_2         x         y
0        1       A -7.935897  8.209067
1        1       B -3.747197  7.556609
   group_1 group_2         x         y
2        2       C -6.889313  3.595947
3        2       C -5.122488  3.697471
   group_1 group_2         x        y
4        3       D -8.112173  2.58461
   group_1 group_2         x         y
5        4       D -7.337516  4.620535

More than one element in by:

Here, no matter the use case, the intermediate objects are Series

df.groupby(["group_1", "group_2"]).agg(lambda x : print(x))
0    7.130014
Name: x, dtype: float64
1    8.12832
Name: x, dtype: float64
2   -7.127394
3   -8.320946
Name: x, dtype: float64
4   -4.383301
Name: x, dtype: float64
5    2.152988
Name: x, dtype: float64
0    3.789269
Name: y, dtype: float64
1    0.040574
Name: y, dtype: float64
2    2.372136
3    9.548322
Name: y, dtype: float64
4    3.798372
Name: y, dtype: float64
5    6.701758
Name: y, dtype: float64
df.groupby(["group_1", "group_2"]).agg(lambda x,y: print(x), y=1)
0    7.130014
Name: x, dtype: float64
1    8.12832
Name: x, dtype: float64
2   -7.127394
3   -8.320946
Name: x, dtype: float64
4   -4.383301
Name: x, dtype: float64
5    2.152988
Name: x, dtype: float64
0    3.789269
Name: y, dtype: float64
1    0.040574
Name: y, dtype: float64
2    2.372136
3    9.548322
Name: y, dtype: float64
4    3.798372
Name: y, dtype: float64
5    6.701758
Name: y, dtype: float64

Issue Description

The intermediate objects' type is not consistent when ITERABLE provided in df.groupby(by=<ITERABLE>).agg(func = lambda x:print(x), y=1) is of length one with respect to length bigger than one.

Expected Behavior

When ITERABLE is of length one and **kwargs/*args , intermediate objects should also be expected to be instances of Series.

Installed Version

commit : 4bfe3d0
python : 3.10.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.14.10-300.fc35.x86_64
Version : #1 SMP Thu Oct 7 20:48:44 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.4.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
pip : 22.1.1
setuptools : 57.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@sondalex sondalex changed the title BUG: DataFrame.groupby.agg has inconsistent behaviour depending on DataFrame.groupby by Iterable length and use of DataFrame.groupby.agg's *args/**kwargs BUG: DataFrame.groupby.agg has inconsistent behaviour depending on DataFrame.groupby by's Iterable length and use of DataFrame.groupby.agg's *args/**kwargs May 22, 2022
@simonjayhawkins
Copy link
Member

Thanks @sondalex for the report.

This does look like a bug and the different codepath taken for 1 groupby key with args/kwargs specified is here.

# grouper specific aggregations
if self.grouper.nkeys > 1:
# test_groupby_as_index_series_scalar gets here with 'not self.as_index'
return self._python_agg_general(func, *args, **kwargs)
elif args or kwargs:
# test_pass_args_kwargs gets here (with and without as_index)
# can't return early
result = self._aggregate_frame(func, *args, **kwargs)

with _aggregate_frame passing the DataFrame instead of the Series.

cc @rhshadrach

@simonjayhawkins simonjayhawkins added Bug Groupby Apply Apply, Aggregate, Transform, Map labels May 25, 2022
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone May 25, 2022
@rhshadrach
Copy link
Member

Thanks for the report @sondalex. This is similar to (but more thorough than) #39169.

I agree that this is problematic and if there are to be different things passed to the UDF, the user should control this. The hard problem is how to go about changing behavior, partially because we don't know the UDFs users are using and partially because of how many ways this code can be hit (e.g. I believe df.resample(...).sum() and df.groupby(...).resample(...).sum() both hit it). Because of this, I'm of the opinion we should add an experimental option to enable new implementations of these methods where we can clean up this and a whole range of other things (#41112).

@simonjayhawkins
Copy link
Member

Thanks @rhshadrach. Yes, I now see that you covered this in #39169 (comment)

will close as duplicate

This is similar to (but more thorough than)

As a simpler reproducer based on the above (could still be simplified further)

import pandas as pd
import numpy as np

data = {
    "group_1": [1, 1, 2, 2, 3, 4],
    "group_2": ["A", "B", "C", "C", "D", "D"],
    "x": np.random.uniform(-10, 10, 6),
    "y": np.random.uniform(0, 10, 6),
}
df = pd.DataFrame(data)

func = lambda x: type(x).__name__
func_with_kwarg = lambda x,y: type(x).__name__

grp= df.groupby(["group_1"])
print(grp.agg(func))
print(grp.agg(func_with_kwarg, y=1))

grp= df.groupby(["group_1", "group_2"])
print(grp.agg(func))
print(grp.agg(func_with_kwarg, y=1))
        group_2       x       y
group_1                        
1        Series  Series  Series
2        Series  Series  Series
3        Series  Series  Series
4        Series  Series  Series
           group_2          x          y
group_1                                 
1        DataFrame  DataFrame  DataFrame
2        DataFrame  DataFrame  DataFrame
3        DataFrame  DataFrame  DataFrame
4        DataFrame  DataFrame  DataFrame
                      x       y
group_1 group_2                
1       A        Series  Series
        B        Series  Series
2       C        Series  Series
3       D        Series  Series
4       D        Series  Series
                      x       y
group_1 group_2                
1       A        Series  Series
        B        Series  Series
2       C        Series  Series
3       D        Series  Series
4       D        Series  Series

@simonjayhawkins simonjayhawkins added the Duplicate Report Duplicate issue or pull request label May 27, 2022
@simonjayhawkins simonjayhawkins removed this from the Contributions Welcome milestone May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Duplicate Report Duplicate issue or pull request Groupby
Projects
None yet
Development

No branches or pull requests

3 participants