-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Inconsistent behavior when using GroupBy and pandas.Series.mode #25581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yea I see where this is confusing but generally we don't support agg reducing to anything but a scalar, so I would think the first example returning something is pure coincidence and if anything the second example would be what we should be doing here. Others may disagree though so let's see. FWIW apply is a more general function so you'd be safer to do something like: df2.groupby(0)[1].apply(lambda x: list(x.mode())) |
As explained in the #19254 df2.groupby(0)[1].apply(lambda x: list(x.mode())) is really slow so it would be beneficial to add a separate groupby.mode() function implemented in the Cython. Such functionality is very often used for the categorical data so I am supprised that it still has not been implemented in Pandas. |
This looks to work on master now (albeit agreed that agg should just be a reducer). Could use a test
|
..so did someone fix this, or is it just another 'accidental' example of something that works, but given a different example it may fail again? |
I am still seeing this behavior in If the first groupby result returns a single mode, a subsequent groupby result that contains multiple modes will insert a list of the multiple modes. However, if the first groupby result would return multiple modes, it returns a In [2]: test_df = pd.DataFrame({
'value': [1, 1, 2, 3],
'group': ['a', 'a', 'b', 'b'],
})
In [4]: gb = test_df.groupby('group')
In [5]: mode_series = gb['value'].agg(pd.Series.mode)
In [6]: mode_series
Out[6]:
group
a 1
b [2, 3]
Name: value, dtype: object
In [7]: test2_df = pd.DataFrame({
'value': [1, 2, 3, 3,],
'group': ['a', 'a', 'b', 'b'],
})
In [8]: gb2 = test2_df.groupby('group')
In [9]: mode_series2 = gb2['value'].agg(pd.Series.mode)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [9], in <cell line: 1>()
----> 1 mode_series2 = gb2['value'].agg(pd.Series.mode)
# <snip>
File ~\Miniconda3\envs\pandas\lib\site-packages\pandas\core\groupby\ops.py:1010, in BaseGrouper._aggregate_series_pure_python(self, obj, func)
1006 res = libreduction.extract_result(res)
1008 if not initialized:
1009 # We only do this validation on the first iteration
-> 1010 libreduction.check_result_array(res, group.dtype)
1011 initialized = True
1013 counts[i] = group.shape[0]
File ~\Miniconda3\envs\pandas\lib\site-packages\pandas\_libs\reduction.pyx:13, in pandas._libs.reduction.check_result_array()
File ~\Miniconda3\envs\pandas\lib\site-packages\pandas\_libs\reduction.pyx:21, in pandas._libs.reduction.check_result_array()
ValueError: Must produce aggregated value pd.show_versions()INSTALLED VERSIONScommit : 06d2301 pandas : 1.4.1 |
this is an open issue community patches are how pandas is updated |
take |
What is the structure supposed to be for the result? |
take |
* TST: Test aggregate with list values pandas-dev#25581 * TST: Fixed PEP 8 warnings pandas-dev#25581 * TST: Fixed Flake8 Error pandas-dev#25581 * TST: Added more test cases pandas-dev#25581 * TST: Changed broken test to xfail pandas-dev#25581 Co-authored-by: Steven Rotondo <[email protected]>
FWIW, this issue still persists as of pandas 1.5.3. Here's an example that reproduces the issue: df1 = pd.DataFrame({'grouper': ['a', 'a', 'a', 'b', 'b'], 'value': [1, 0, 0, 0, 1]})
df2 = pd.DataFrame({'grouper': ['b', 'b', 'b', 'a', 'a'], 'value': [1, 0, 0, 0, 1]})
print(df1.assign(grouper=df1['grouper'].map({'a':'b', 'b':'a'})).equals(df2)) # True
print(df1.groupby('grouper')['value'].agg(pd.Series.mode)) # OK
print(df2.groupby('grouper')['value'].agg(pd.Series.mode)) # ValueError: Must produce aggregated value It works fine with |
Code Sample
Problem Description
As it seems, the above code works great for df1, returning the following result:
(where C is a str, and [A, B] is a numpy.ndarray)
However, it doesn't work for df2, throwing the following exception:
It looks like the order of the processing of the agg function affects pandas error checking; if the first result (GroupBy row) returns an numpy.ndarray - the above exception is thrown, but if the first result return a str/scalar - the processing continues and suppresses further such exceptions.
In my opinion, item may be related to #2656 or one of its root causes ("fast apply vs. old pathway" as they put it), and to #24016 (which fails to accept numpy.ndarray as a return value). However, in our case - where pandas.Series.mode sometimes returns a scalar and sometimes a numpy.ndarray, it is more illusive and more confusing and therefore inconsistent (I spent ~2 hours debugging this trying to understand why so many of my agg function calls work but only one doesn't).
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 40.4.3
Cython: None
numpy: 1.15.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.0
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: