You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I can group with as_index=True and I get a series with the group keys + the original keys as a row multiindex. with as_index=False I get this exception:
Details
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[155], line 4
1import pandas as pd
3 df = pd.DataFrame({'A': [1, 1, 2, 1, 2], 'B': [5, 6, 7, 8, 9]})
----> 4 df.groupby('A', as_index=False)['B'].nlargest()
File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1065, in SeriesGroupBy.nlargest(self, n, keep)
1062 data = self._selected_obj
1063 # Don't change behavior if result index happens to be the same, i.e.
1064 # already ordered and n >= all group sizes.
-> 1065 result = self._python_apply_general(f, data, not_indexed_same=True)
1066 return result
File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1406, in GroupBy._python_apply_general(self, f, data, not_indexed_same, is_transform, is_agg)
1403 if not_indexed_same is None:
1404 not_indexed_same = mutated
-> 1406 return self._wrap_applied_output(
1407 data,
1408 values,
1409 not_indexed_same,
1410 is_transform,
1411 )
File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/generic.py:390, in SeriesGroupBy._wrap_applied_output(self, data, values, not_indexed_same, is_transform)
388 result.name =self.obj.name
389ifnotself.as_index and not_indexed_same:
--> 390 result = self._insert_inaxis_grouper(result)
391 result.index = default_index(len(result))
392return result
File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1106, in GroupBy._insert_inaxis_grouper(self, result)
1098 for name, lev, in_axis in zip(
1099 reversed(self.grouper.names),
1100 reversed(self.grouper.get_group_levels()),
(...)
1103 # GH #28549
1104 # When using .apply(-), name will be in columns already
1105 if in_axis and name not in columns:
-> 1106 result.insert(0, name, lev)
1108 return result
File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/frame.py:4776, in DataFrame.insert(self, loc, column, value, allow_duplicates)
4773 if not isinstance(loc, int):
4774 raise TypeError("loc must be int")
-> 4776 value = self._sanitize_column(value)
4777 self._mgr.insert(loc, column, value)
File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/frame.py:4870, in DataFrame._sanitize_column(self, value)
4867 return _reindex_for_setitem(Series(value), self.index)
4869 if is_list_like(value):
-> 4870 com.require_length_match(value, self.index)
4871 return sanitize_array(value, self.index, copy=True, allow_2d=True)
File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/common.py:576, in require_length_match(data, index)
572"""573 Check the length of data matches the length of the index.
574"""575iflen(data) !=len(index):
--> 576 raise ValueError(
577"Length of values "578f"({len(data)}) "579"does not match length of index "580f"({len(index)})"581 )
ValueError: Length of values (2) does not match length of index (5)
Expected Behavior
Should not throw an exception
Installed Versions
INSTALLED VERSIONS
commit : 965ceca
python : 3.8.16.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
The return types for both are apparently different as shown,
type : <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
type : <class 'pandas.core.groupby.generic.SeriesGroupBy'>
It turns out, DataFrameGroupBy class objects don't support nlargest()
In pandas/core/series.py it is mentioned that as_index can be False only as DataFrame,
if not as_index:
raise TypeError("as_index=False only valid with DataFrame")
You can use this as a workaround, cast the groupby() output to pd.DataFrame() then the nlargest() method will require two arguments n: int and columns: list.
Here is an example to start with,
import pandas as pd
import tabulate
# pass as_index as False and cast to pandas.DataFrame
series_result = pd.DataFrame(df.groupby('A', as_index=False)['B'])
print("type : ", type(series_result))
# pass regular call to groupby
ser_result_true = df.groupby('A')['B']
print("type : ", type(ser_result_true))
print("Showing series_result (with as_index = False)")
print(series_result.to_markdown())
print('Showing ser_result_true')
# casted only to print as markdown
print(pd.DataFrame(ser_result_true).to_markdown())
result_true = ser_result_true.nlargest()
# result from default call
print('result_true markdown')
print(result_true.to_markdown())
# result from as_index=False call to groupby
result_false = series_result.nlargest(1, columns=[0])
print('result_false markdown')
print(result_false.to_markdown())
Thanks for the report! This is some what complicated by #53707. If we were to treat nsmallest and nlargest as filters, then as_index should make no difference. My recommendation to hold off fixing this until these do behave like filters, and then ensure as_index plays no role in producing the result.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I can group with
as_index=True
and I get a series with the group keys + the original keys as a row multiindex. withas_index=False
I get this exception:Details
Expected Behavior
Should not throw an exception
Installed Versions
INSTALLED VERSIONS
commit : 965ceca
python : 3.8.16.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.0.2
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 66.0.0
pip : 23.0.1
Cython : 0.29.34
pytest : 7.3.1
hypothesis : None
sphinx : 7.0.0
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : 2.9.6
jinja2 : 3.1.2
IPython : 8.12.1
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli : 1.0.9
fastparquet : 2022.12.0
fsspec : 2023.4.0
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.19.1
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.4.0
scipy : 1.10.1
snappy : None
sqlalchemy : 1.4.45
tables : 3.8.0
tabulate : None
xarray : 2023.1.0
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: