Skip to content

BUG: Series mode crashes when there is more than one mode #38534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 task
JulioV opened this issue Dec 17, 2020 · 10 comments
Open
1 task

BUG: Series mode crashes when there is more than one mode #38534

JulioV opened this issue Dec 17, 2020 · 10 comments
Assignees
Labels
Apply Apply, Aggregate, Transform, Map Bug Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@JulioV
Copy link

JulioV commented Dec 17, 2020

  • [ x] I have checked that this issue has not already been reported.

  • [ x] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
pd.DataFrame({"data" : [1,1], "value":[0,1]}).groupby("data").agg({"value":pd.Series.mode})

Problem description

This line of code crashes with ValueError: Must produce aggregated value

Traceback (most recent call last):
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 261, in aggregate
    func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/groupby.py", line 1085, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 658, in agg_series
    return self._aggregate_series_pure_python(obj, func)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/ops.py", line 717, in _aggregate_series_pure_python
    raise ValueError("Function does not reduce")
ValueError: Function does not reduce

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 951, in aggregate
    result, how = self._aggregate(func, *args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/base.py", line 416, in _aggregate
    result = _agg(arg, _agg_1dim)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/base.py", line 383, in _agg
    result[fname] = func(fname, agg_how)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/base.py", line 367, in _agg_1dim
    return colg.aggregate(how)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 267, in aggregate
    result = self._aggregate_named(func, *args, **kwargs)
  File "/usr/local/Caskroom/miniconda/base/envs/rapids202012/lib/python3.7/site-packages/pandas/core/groupby/generic.py", line 482, in _aggregate_named
    raise ValueError("Must produce aggregated value")
ValueError: Must produce aggregated value

Expected Output

[0,1]

Output of pd.show_versions()

INSTALLED VERSIONS

commit : b5958ee
python : 3.7.9.final.0
python-bits : 64
OS : Darwin
OS-release : 20.1.0
Version : Darwin Kernel Version 20.1.0: Sat Oct 31 00:07:11 PDT 2020; root:xnu-7195.50.7~2/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.5
numpy : 1.19.2
pytz : 2020.4
dateutil : 2.8.1
pip : 20.3.3
setuptools : 51.0.0.post20201207
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@JulioV JulioV added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 17, 2020
@simonjayhawkins simonjayhawkins added Apply Apply, Aggregate, Transform, Map Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 18, 2020
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Dec 18, 2020
@mzeitlin11
Copy link
Member

Problem seems to arise here:

output = func(group, *args, **kwargs)
output = libreduction.extract_result(output)
if not initialized:
# We only do this validation on the first iteration
libreduction.check_result_array(output, 0)
initialized = True

pd.Series.mode always returns a Series, then if the length is > 1 it gets converted to an ndarray which raises a ValueError in check_result_array. Interestingly, this check only happens on the first group, so

pd.DataFrame({"data": [1, 1, 2, 2], "value": [1, 2, 0, 0]}).groupby("data").agg({"value": pd.Series.mode})

raises, but ensuring the single mode group comes first

pd.DataFrame({"data": [1, 1, 2, 2], "value": [0, 0, 1, 2]}).groupby("data").agg({"value": pd.Series.mode})

gives the output

       value
data        
1          0
2     [1, 2]

@skvrahul
Copy link
Contributor

Is the validation on only the first group a time-saving optimization or intended behaviour for any reason?

@skvrahul
Copy link
Contributor

@mzeitlin11 I would like to take this up if you haven't already started work on this.

@mzeitlin11
Copy link
Member

@mzeitlin11 I would like to take this up if you haven't already started work on this.

I haven't, go ahead!

@skvrahul
Copy link
Contributor

take

@chahak13
Copy link

chahak13 commented Jul 5, 2021

Hello,

Are there any updates on this? I recently came across the same issue too. I also faced an issue with the mode function but on further checking, I found that the issue is not just with mode, it is basically any function that can possibly give a multi-value result. For instance, just for a contrived example with sort_values() in place of mode is also raising the same error for me.

@chahak13
Copy link

chahak13 commented Jul 5, 2021

I would like to help if I can but I don't have any experience in deep-diving into pandas. @skvrahul @mzeitlin11 Please let me know if there's anything I can do on this issue.

@mzeitlin11
Copy link
Member

This is a somewhat tricky issue because nested data can be problematic, see other issues like #39436. Not sure best way to fix this, but investigations are certainly welcome if you're interested @chahak13!

@skvrahul
Copy link
Contributor

skvrahul commented Jul 6, 2021

I apologize for not having gotten to this earlier.

After doing some digging,
It looks like the code is written in such a way that it expects the dtype of the aggregation result to be equal to that of the Series that is being aggregated.

This is where the Error is being thrown:

res = self.f(cached_series)
res = extract_result(res)
if not initialized:
# On the first pass, we check the output shape to see
# if this looks like a reduction.
initialized = True
check_result_array(res, cached_series.dtype)

cpdef check_result_array(object obj, object dtype):
# Our operation is supposed to be an aggregation/reduction. If
# it returns an ndarray, this likely means an invalid operation has
# been passed. See test_apply_without_aggregation, test_agg_must_agg
if is_array(obj):
if dtype != _dtype_obj:
# If it is object dtype, the function can be a reduction/aggregation
# and still return an ndarray e.g. test_agg_over_numpy_arrays
raise ValueError("Must produce aggregated value")

Now the below code snippet doesn't throw an error since the aggregation function returns the same dtype as the Series being aggregated.

import pandas as pd

def sum_list(s):
    return [sum(x) for x in s]

    
df = pd.DataFrame(
    {
        "data": [1, 1, 1],
        "value": [[1,2,3], [4,0,8], [11,7,9]]
    }
)

print(df)

'''
    data       value
0     1   [1, 2, 3]
1     1   [4, 0, 8]
2     1  [11, 7, 9]
'''


print(df.value.dtype) # dtype = object
result = df.groupby("data").agg({"value": sum_list})

print(result)
'''

data  value     
1     [6, 12, 27]
'''
print(result.value.dtype)  # dtype = object

@skvrahul
Copy link
Contributor

skvrahul commented Jul 6, 2021

I guess it is up to the others to weigh in here whether this is intentional and expected behaviour or needs any changes?

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

No branches or pull requests

6 participants