Skip to content

BUG: pd.Series.mode fails (sometimes) as aggregation method on groupby object #41368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
Huite opened this issue May 7, 2021 · 7 comments
Closed
2 of 3 tasks
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby

Comments

@Huite
Copy link

Huite commented May 7, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

This popped up randomly for some of my dataframes, but only a few.
EDIT: Thanks to @mzeitlin11 below for a properly minimal example.

import pandas as pd

df = pd.DataFrame([1, 1, 2, 2])
df.groupby([0, 0, 0, 0]).agg([pd.Series.mode])

This results in a ValueError: no results:

ValueError                                Traceback (most recent call last)
c:\tmp\zonal-mode\debug.py in 
----> 7 selection.groupby("id").agg([pd.Series.mode])

C:\Miniconda3\envs\main\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
    943         func = maybe_mangle_lambdas(func)
    944 
--> 945         result, how = aggregate(self, func, *args, **kwargs)
    946         if how is None:
    947             return result

C:\Miniconda3\envs\main\lib\site-packages\pandas\core\aggregation.py in aggregate(obj, arg, *args, **kwargs)
    584         # we require a list, but not an 'str'
    585         arg = cast(List[AggFuncTypeBase], arg)
--> 586         return agg_list_like(obj, arg, _axis=_axis), None
    587     else:
    588         result = None

C:\Miniconda3\envs\main\lib\site-packages\pandas\core\aggregation.py in agg_list_like(obj, arg, _axis)
    672         raise ValueError("no results")
    673 
--> 674     try:
    675         return concat(results, keys=keys, axis=1, sort=False)
    676     except TypeError as err:

ValueError: no results

other methods seem to work okay:

selection.groupby("id").agg([pd.Series.sum])
selection.groupby("id").agg([pd.Series.mean])
selection.groupby("id").agg([pd.Series.std])
selection.groupby("id").agg(["count"])

Oddly enough, slightly changing the selection makes it run:

selection.iloc[100:].groupby("id").agg([pd.Series.mode])  # works
selection.iloc[:900].groupby("id").agg([pd.Series.mode])  # works
selection.iloc[:917].groupby("id").agg([pd.Series.mode])  # works
selection.iloc[:918].groupby("id").agg([pd.Series.mode])  # errors
selection.iloc[18:918].groupby("id").agg([pd.Series.mode])  # works
selection.iloc[17:918].groupby("id").agg([pd.Series.mode])  # errors

scipy.stats.mode also works fine:

import scipy.stats
selection.groupby("id").agg([scipy.stats.mode])

Problem description

I don't understand why it seems to randomly fail, there are no oddities with the data as far as I can tell.
Given the difference the selections make, I guess it's related with the specific number of occurences, which would suggest something internally in implementation of pd.Series.mode.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2cb9652
python : 3.8.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Dutch_Netherlands.1252

pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 49.6.0.post20210108
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : 3.5.4
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.23.1
pandas_datareader: None
bs4 : None
bottleneck : 1.3.2
fsspec : 2021.04.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : None
xarray : 0.18.0
xlrd : None
xlwt : None
numba : 0.52.0

@Huite Huite added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 7, 2021
@MarcoGorelli
Copy link
Member

Thanks @Huite for your report - to expedite resolution, could you please make your example copy-and-pasteable (i.e. without requiring external data to be downloaded)? E.g. see here #41320 for such a bug report

@Huite
Copy link
Author

Huite commented May 7, 2021

Hi @MarcoGorelli, I've updated the example. I can sympathize with not downloading my random files, I wasn't sure given the 1000 lines. I've put it behind another one of those <details> sections, and put an abridged example in the main body of the post.

@MarcoGorelli
Copy link
Member

Thanks @Huite - what I meant was, does the problem only appear if you have 1000 lines? If not, could you try narrowing it down, so that there's a small reproducible example (such as the one in #41320 ) which could be used as a test

@mzeitlin11
Copy link
Member

Skipping the fallback logic reveals what looks like the actual error is:

Traceback (most recent call last):
  File "/Users/matthewzeitlin/miniconda3/envs/pandas-dev/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-af0500c19e92>", line 1, in <module>
    sgrouper.get_result()
  File "pandas/_libs/reduction.pyx", line 279, in pandas._libs.reduction.SeriesGrouper.get_result
    res, initialized = self._apply_to_group(cached_series, cached_index,
  File "pandas/_libs/reduction.pyx", line 95, in pandas._libs.reduction._BaseGrouper._apply_to_group
    check_result_array(res, cached_series.dtype)
  File "pandas/_libs/reduction.pyx", line 38, in pandas._libs.reduction.check_result_array
    raise ValueError("Must produce aggregated value")
ValueError: Must produce aggregated value

I think the root cause here is the same as #38534.

Regardless of that this error message isn't overly helpful - would be nice to report why there are no results with more details.

@mzeitlin11
Copy link
Member

mzeitlin11 commented May 7, 2021

An MRE would be

df = pd.DataFrame([1, 1, 2, 2])
df.groupby([0, 0, 0, 0]).agg([pd.Series.mode])

or

df = pd.DataFrame([1, 1, 2, 2])
df.groupby([0, 0, 0, 0]).agg(pd.Series.mode)

which gives a more informative but not a great user-facing error:

Traceback (most recent call last):
  File "/Users/matthewzeitlin/Code/contrib/pandas-mzeitlin11/try.py", line 15, in <module>
    df.groupby([0, 0, 0, 0]).agg(pd.Series.mode)
  File "/Users/matthewzeitlin/Code/contrib/pandas-mzeitlin11/pandas/core/groupby/generic.py", line 1008, in aggregate
    return self._python_agg_general(func, *args, **kwargs)
  File "/Users/matthewzeitlin/Code/contrib/pandas-mzeitlin11/pandas/core/groupby/groupby.py", line 1280, in _python_agg_general
    return self._python_apply_general(f, self._selected_obj)
  File "/Users/matthewzeitlin/Code/contrib/pandas-mzeitlin11/pandas/core/groupby/groupby.py", line 1249, in _python_apply_general
    keys, values, mutated = self.grouper.apply(f, data, self.axis)
  File "/Users/matthewzeitlin/Code/contrib/pandas-mzeitlin11/pandas/core/groupby/ops.py", line 777, in apply
    result_values, mutated = splitter.fast_apply(f, sdata, group_keys)
  File "/Users/matthewzeitlin/Code/contrib/pandas-mzeitlin11/pandas/core/groupby/ops.py", line 1293, in fast_apply
    return libreduction.apply_frame_axis0(sdata, f, names, starts, ends)
  File "pandas/_libs/reduction.pyx", line 381, in pandas._libs.reduction.apply_frame_axis0
    piece = f(chunk)
  File "/Users/matthewzeitlin/Code/contrib/pandas-mzeitlin11/pandas/core/groupby/groupby.py", line 1258, in <lambda>
    f = lambda x: func(x, *args, **kwargs)
  File "/Users/matthewzeitlin/Code/contrib/pandas-mzeitlin11/pandas/core/series.py", line 1949, in mode
    return algorithms.mode(self, dropna=dropna)
  File "/Users/matthewzeitlin/Code/contrib/pandas-mzeitlin11/pandas/core/algorithms.py", line 977, in mode
    result = f(values, dropna=dropna)
  File "pandas/_libs/hashtable_func_helper.pxi", line 1764, in pandas._libs.hashtable.mode_int64
    def mode_int64(const int64_t[:] values, bint dropna):
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

@Huite
Copy link
Author

Huite commented May 7, 2021

Thanks @mzeitlin11, I've included your MRE in the original post instead of my, ehm, rather non-minimal example.

This basically makes this a duplicate of #38534, although I agree it would also be worthwhile to replace ValueError: no results by something more informative -- but that should probably be a separate issue.

@lithomas1 lithomas1 added Groupby Duplicate Report Duplicate issue or pull request and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 7, 2021
@lithomas1
Copy link
Member

lithomas1 commented May 7, 2021

@Huite OK closing as duplicate then. It's probably better to track all discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Duplicate Report Duplicate issue or pull request Groupby
Projects
None yet
Development

No branches or pull requests

4 participants