Inconsistent behavior when using GroupBy and pandas.Series.mode #25581

jointfull · 2019-03-07T02:30:31Z

Code Sample

import pandas

# Works great
df1 = pandas.DataFrame([[20,'A'],[20,'B'],[10,'C']])
gb1 = df1.groupby(0).agg(pandas.Series.mode)
display(gb1)

# Exception: Must produce aggregated value
df2 = pandas.DataFrame([[20,'A'],[20,'B'],[30,'C']])
gb2 = df2.groupby(0).agg(pandas.Series.mode)
display(gb2)

Problem Description

As it seems, the above code works great for df1, returning the following result:

(where C is a str, and [A, B] is a numpy.ndarray)

However, it doesn't work for df2, throwing the following exception:

...
C:\ProgramData\Anaconda3\envs\py36\lib\site-packages\pandas\core\groupby\generic.py in _aggregate_named(self, func, *args, **kwargs)
    907             output = func(group, *args, **kwargs)
    908             if isinstance(output, (Series, Index, np.ndarray)):
--> 909                 raise Exception('Must produce aggregated value')
    910             result[name] = self._try_cast(output, group)
    911 

Exception: Must produce aggregated value

It looks like the order of the processing of the agg function affects pandas error checking; if the first result (GroupBy row) returns an numpy.ndarray - the above exception is thrown, but if the first result return a str/scalar - the processing continues and suppresses further such exceptions.

In my opinion, item may be related to #2656 or one of its root causes ("fast apply vs. old pathway" as they put it), and to #24016 (which fails to accept numpy.ndarray as a return value). However, in our case - where pandas.Series.mode sometimes returns a scalar and sometimes a numpy.ndarray, it is more illusive and more confusing and therefore inconsistent (I spent ~2 hours debugging this trying to understand why so many of my agg function calls work but only one doesn't).

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 40.4.3
Cython: None
numpy: 1.15.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-03-07T02:45:05Z

Yea I see where this is confusing but generally we don't support agg reducing to anything but a scalar, so I would think the first example returning something is pure coincidence and if anything the second example would be what we should be doing here.

Others may disagree though so let's see. FWIW apply is a more general function so you'd be safer to do something like:

df2.groupby(0)[1].apply(lambda x: list(x.mode()))

j-musial · 2019-03-15T15:24:36Z

As explained in the #19254 df2.groupby(0)[1].apply(lambda x: list(x.mode())) is really slow so it would be beneficial to add a separate groupby.mode() function implemented in the Cython. Such functionality is very often used for the categorical data so I am supprised that it still has not been implemented in Pandas.

mroeschke · 2021-06-27T19:42:05Z

This looks to work on master now (albeit agreed that agg should just be a reducer). Could use a test

In [9]: df2 = pandas.DataFrame([[20,'A'],[20,'B'],[30,'C']])
   ...: gb2 = df2.groupby(0).agg(pandas.Series.mode)

In [10]: gb2
Out[10]:
         1
0
20  [A, B]
30       C

jointfull · 2021-07-11T08:24:44Z

..so did someone fix this, or is it just another 'accidental' example of something that works, but given a different example it may fail again?

jacobdadams · 2022-03-11T00:26:37Z

I am still seeing this behavior in 1.41.1.

If the first groupby result returns a single mode, a subsequent groupby result that contains multiple modes will insert a list of the multiple modes. However, if the first groupby result would return multiple modes, it returns a ValueError: Must produce aggregated value similar to the original exception.

In [2]: test_df = pd.DataFrame({
           'value': [1, 1, 2, 3],
           'group': ['a', 'a', 'b', 'b'],
       })
In [4]: gb = test_df.groupby('group')
In [5]: mode_series = gb['value'].agg(pd.Series.mode)
In [6]: mode_series
Out[6]:
group
a         1
b    [2, 3]
Name: value, dtype: object

In [7]: test2_df = pd.DataFrame({
           'value': [1, 2, 3, 3,],
           'group': ['a', 'a', 'b', 'b'],
       })
In [8]: gb2 = test2_df.groupby('group')
In [9]: mode_series2 = gb2['value'].agg(pd.Series.mode)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 mode_series2 = gb2['value'].agg(pd.Series.mode)
# <snip>
File ~\Miniconda3\envs\pandas\lib\site-packages\pandas\core\groupby\ops.py:1010, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
   1006 res = libreduction.extract_result(res)
   1008 if not initialized:
   1009     # We only do this validation on the first iteration
-> 1010     libreduction.check_result_array(res, group.dtype)
   1011     initialized = True
   1013 counts[i] = group.shape[0]

File ~\Miniconda3\envs\pandas\lib\site-packages\pandas\_libs\reduction.pyx:13, in pandas._libs.reduction.check_result_array()

File ~\Miniconda3\envs\pandas\lib\site-packages\pandas\_libs\reduction.pyx:21, in pandas._libs.reduction.check_result_array()

ValueError: Must produce aggregated value

pd.show_versions()

INSTALLED VERSIONS

commit : 06d2301
python : 3.10.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19044
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 1.4.1
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.1.1
pandas_datareader: None
bs4 : None
bottleneck : 1.3.2
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

jreback · 2022-03-11T00:32:50Z

this is an open issue

community patches are how pandas is updated

nbarone1 · 2022-03-28T17:38:37Z

take

nbarone1 · 2022-04-24T17:18:09Z

What is the structure supposed to be for the result?

srotondo · 2022-06-30T19:51:26Z

take

* TST: Test aggregate with list values #25581 * TST: Fixed PEP 8 warnings #25581 * TST: Fixed Flake8 Error #25581 * TST: Added more test cases #25581 * TST: Changed broken test to xfail #25581 Co-authored-by: Steven Rotondo <[email protected]>

* TST: Test aggregate with list values pandas-dev#25581 * TST: Fixed PEP 8 warnings pandas-dev#25581 * TST: Fixed Flake8 Error pandas-dev#25581 * TST: Added more test cases pandas-dev#25581 * TST: Changed broken test to xfail pandas-dev#25581 Co-authored-by: Steven Rotondo <[email protected]>

amanlai · 2023-04-13T04:48:54Z

FWIW, this issue still persists as of pandas 1.5.3. Here's an example that reproduces the issue:

df1 = pd.DataFrame({'grouper': ['a', 'a', 'a', 'b', 'b'], 'value': [1, 0, 0, 0, 1]})
df2 = pd.DataFrame({'grouper': ['b', 'b', 'b', 'a', 'a'], 'value': [1, 0, 0, 0, 1]})


print(df1.assign(grouper=df1['grouper'].map({'a':'b', 'b':'a'})).equals(df2))  # True

print(df1.groupby('grouper')['value'].agg(pd.Series.mode))  # OK
print(df2.groupby('grouper')['value'].agg(pd.Series.mode))  # ValueError: Must produce aggregated value

It works fine with groupby.apply() however.

WillAyd added Groupby Needs Discussion Requires discussion from core team before further action labels Mar 7, 2019

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Needs Discussion Requires discussion from core team before further action labels Jun 27, 2021

github-actions bot assigned nbarone1 Mar 28, 2022

nbarone1 removed their assignment Apr 27, 2022

github-actions bot assigned srotondo Jun 30, 2022

srotondo pushed a commit to srotondo/pandas that referenced this issue Jul 1, 2022

TST: Test aggregate with list values pandas-dev#25581

b3dc515

srotondo pushed a commit to srotondo/pandas that referenced this issue Jul 2, 2022

TST: Fixed PEP 8 warnings pandas-dev#25581

a0f1dcb

srotondo pushed a commit to srotondo/pandas that referenced this issue Jul 2, 2022

TST: Fixed Flake8 Error pandas-dev#25581

1fd0a93

jreback mentioned this issue Jul 3, 2022

TST: Test aggregate with list values #25581 #47559

Merged

5 tasks

jreback added this to the 1.5 milestone Jul 3, 2022

jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jul 3, 2022

srotondo pushed a commit to srotondo/pandas that referenced this issue Jul 4, 2022

TST: Added more test cases pandas-dev#25581

0e208c8

srotondo pushed a commit to srotondo/pandas that referenced this issue Jul 5, 2022

TST: Changed broken test to xfail pandas-dev#25581

3d3953d

mroeschke closed this as completed in #47559 Jul 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior when using GroupBy and pandas.Series.mode #25581

Inconsistent behavior when using GroupBy and pandas.Series.mode #25581

jointfull commented Mar 7, 2019

INSTALLED VERSIONS

WillAyd commented Mar 7, 2019 •

edited

Loading

j-musial commented Mar 15, 2019

mroeschke commented Jun 27, 2021

jointfull commented Jul 11, 2021

jacobdadams commented Mar 11, 2022

INSTALLED VERSIONS

jreback commented Mar 11, 2022

nbarone1 commented Mar 28, 2022

nbarone1 commented Apr 24, 2022

srotondo commented Jun 30, 2022

amanlai commented Apr 13, 2023

Inconsistent behavior when using GroupBy and pandas.Series.mode #25581

Inconsistent behavior when using GroupBy and pandas.Series.mode #25581

Comments

jointfull commented Mar 7, 2019

Code Sample

Problem Description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

WillAyd commented Mar 7, 2019 • edited Loading

j-musial commented Mar 15, 2019

mroeschke commented Jun 27, 2021

jointfull commented Jul 11, 2021

jacobdadams commented Mar 11, 2022

INSTALLED VERSIONS

jreback commented Mar 11, 2022

nbarone1 commented Mar 28, 2022

nbarone1 commented Apr 24, 2022

srotondo commented Jun 30, 2022

amanlai commented Apr 13, 2023

Output of `pd.show_versions()`

WillAyd commented Mar 7, 2019 •

edited

Loading