BUG: Series mode crashes when there is more than one mode #38534

JulioV · 2020-12-17T00:38:58Z

[ x] I have checked that this issue has not already been reported.
[ x] I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
pd.DataFrame({"data" : [1,1], "value":[0,1]}).groupby("data").agg({"value":pd.Series.mode})

Problem description

This line of code crashes with ValueError: Must produce aggregated value

Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 261, in aggregate
    func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1085, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 658, in agg_series
    return self._aggregate_series_pure_python(obj, func)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 717, in _aggregate_series_pure_python
    raise ValueError("Function does not reduce")
ValueError: Function does not reduce

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 951, in aggregate
    result, how = self._aggregate(func, *args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/base.py", line 416, in _aggregate
    result = _agg(arg, _agg_1dim)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/base.py", line 383, in _agg
    result[fname] = func(fname, agg_how)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/base.py", line 367, in _agg_1dim
    return colg.aggregate(how)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 267, in aggregate
    result = self._aggregate_named(func, *args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 482, in _aggregate_named
    raise ValueError("Must produce aggregated value")
ValueError: Must produce aggregated value

Expected Output

[0,1]

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : b5958ee
python : 3.7.9.final.0
python-bits : 64
OS : Darwin
OS-release : 20.1.0
Version : Darwin Kernel Version 20.1.0: Sat Oct 31 00:07:11 PDT 2020; root:xnu-7195.50.7~2/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.19.2
pytz : 2020.4
dateutil : 2.8.1
pip : 20.3.3
setuptools : 51.0.0.post20201207
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2020-12-18T21:16:27Z

Problem seems to arise here:

pandas/pandas/core/groupby/generic.py

Lines 480 to 485 in ca3e351

    
           output = func(group, *args, **kwargs) 
        
           output = libreduction.extract_result(output) 
        
           if not initialized: 
        
               # We only do this validation on the first iteration 
        
               libreduction.check_result_array(output, 0) 
        
               initialized = True

pd.Series.mode always returns a Series, then if the length is > 1 it gets converted to an ndarray which raises a ValueError in check_result_array. Interestingly, this check only happens on the first group, so

pd.DataFrame({"data": [1, 1, 2, 2], "value": [1, 2, 0, 0]}).groupby("data").agg({"value": pd.Series.mode})

raises, but ensuring the single mode group comes first

pd.DataFrame({"data": [1, 1, 2, 2], "value": [0, 0, 1, 2]}).groupby("data").agg({"value": pd.Series.mode})

gives the output

       value
data        
1          0
2     [1, 2]

skvrahul · 2020-12-20T09:17:31Z

Is the validation on only the first group a time-saving optimization or intended behaviour for any reason?

skvrahul · 2020-12-20T09:18:09Z

@mzeitlin11 I would like to take this up if you haven't already started work on this.

mzeitlin11 · 2020-12-20T15:26:46Z

@mzeitlin11 I would like to take this up if you haven't already started work on this.

I haven't, go ahead!

skvrahul · 2020-12-20T18:16:31Z

take

chahak13 · 2021-07-05T05:46:47Z

Hello,

Are there any updates on this? I recently came across the same issue too. I also faced an issue with the mode function but on further checking, I found that the issue is not just with mode, it is basically any function that can possibly give a multi-value result. For instance, just for a contrived example with sort_values() in place of mode is also raising the same error for me.

chahak13 · 2021-07-05T05:47:45Z

I would like to help if I can but I don't have any experience in deep-diving into pandas. @skvrahul @mzeitlin11 Please let me know if there's anything I can do on this issue.

mzeitlin11 · 2021-07-05T16:29:18Z

This is a somewhat tricky issue because nested data can be problematic, see other issues like #39436. Not sure best way to fix this, but investigations are certainly welcome if you're interested @chahak13!

skvrahul · 2021-07-06T05:58:57Z

I apologize for not having gotten to this earlier.

After doing some digging,
It looks like the code is written in such a way that it expects the dtype of the aggregation result to be equal to that of the Series that is being aggregated.

This is where the Error is being thrown:

pandas/pandas/_libs/reduction.pyx

Lines 88 to 94 in dad3e7f

    
           res = self.f(cached_series) 
        
           res = extract_result(res) 
        
           if not initialized: 
        
               # On the first pass, we check the output shape to see 
        
               #  if this looks like a reduction. 
        
               initialized = True 
        
               check_result_array(res, cached_series.dtype)

pandas/pandas/_libs/reduction.pyx

Lines 29 to 37 in dad3e7f

    
           cpdef check_result_array(object obj, object dtype): 
        
               # Our operation is supposed to be an aggregation/reduction. If 
        
               #  it returns an ndarray, this likely means an invalid operation has 
        
               #  been passed. See test_apply_without_aggregation, test_agg_must_agg 
        
               if is_array(obj): 
        
                   if dtype != _dtype_obj: 
        
                       # If it is object dtype, the function can be a reduction/aggregation 
        
                       #  and still return an ndarray e.g. test_agg_over_numpy_arrays 
        
                       raise ValueError("Must produce aggregated value")

Now the below code snippet doesn't throw an error since the aggregation function returns the same dtype as the Series being aggregated.

import pandas as pd

def sum_list(s):
    return [sum(x) for x in s]

    
df = pd.DataFrame(
    {
        "data": [1, 1, 1],
        "value": [[1,2,3], [4,0,8], [11,7,9]]
    }
)

print(df)

'''
    data       value
0     1   [1, 2, 3]
1     1   [4, 0, 8]
2     1  [11, 7, 9]
'''


print(df.value.dtype) # dtype = object
result = df.groupby("data").agg({"value": sum_list})

print(result)
'''

data  value     
1     [6, 12, 27]
'''
print(result.value.dtype)  # dtype = object

skvrahul · 2021-07-06T06:04:45Z

I guess it is up to the others to weigh in here whether this is intentional and expected behaviour or needs any changes?

JulioV added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 17, 2020

simonjayhawkins added Apply Apply, Aggregate, Transform, Map Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 18, 2020

simonjayhawkins added this to the Contributions Welcome milestone Dec 18, 2020

github-actions bot assigned skvrahul Dec 20, 2020

mzeitlin11 mentioned this issue May 7, 2021

BUG: pd.Series.mode fails (sometimes) as aggregation method on groupby object #41368

Closed

3 tasks

mzeitlin11 mentioned this issue Oct 20, 2021

BUG: failed to aggregate #44119

Closed

3 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Series mode crashes when there is more than one mode #38534

BUG: Series mode crashes when there is more than one mode #38534

JulioV commented Dec 17, 2020

Expected Output

Output of `pd.show_versions()`

INSTALLED VERSIONS

mzeitlin11 commented Dec 18, 2020

skvrahul commented Dec 20, 2020

skvrahul commented Dec 20, 2020

mzeitlin11 commented Dec 20, 2020

skvrahul commented Dec 20, 2020

chahak13 commented Jul 5, 2021 •

edited

Loading

chahak13 commented Jul 5, 2021

mzeitlin11 commented Jul 5, 2021

skvrahul commented Jul 6, 2021

skvrahul commented Jul 6, 2021

BUG: Series mode crashes when there is more than one mode #38534

BUG: Series mode crashes when there is more than one mode #38534

Comments

JulioV commented Dec 17, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mzeitlin11 commented Dec 18, 2020

skvrahul commented Dec 20, 2020

skvrahul commented Dec 20, 2020

mzeitlin11 commented Dec 20, 2020

skvrahul commented Dec 20, 2020

chahak13 commented Jul 5, 2021 • edited Loading

chahak13 commented Jul 5, 2021

mzeitlin11 commented Jul 5, 2021

skvrahul commented Jul 6, 2021

skvrahul commented Jul 6, 2021

Output of `pd.show_versions()`

chahak13 commented Jul 5, 2021 •

edited

Loading