BUG: SeriesGroupBy nlargest with as_index=False raises ValueError: Length of values (2) does not match length of index (5) #53706

mvashishtha · 2023-06-16T23:19:54Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({'A': [1, 1, 2, 1, 2], 'B': [5, 6, 7, 8, 9]})
df.groupby('A', as_index=False)['B'].nlargest()

Issue Description

I can group with as_index=True and I get a series with the group keys + the original keys as a row multiindex. with as_index=False I get this exception:

Details

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[155], line 4
      1 import pandas as pd
      3 df = pd.DataFrame({'A': [1, 1, 2, 1, 2], 'B': [5, 6, 7, 8, 9]})
----> 4 df.groupby('A', as_index=False)['B'].nlargest()

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1065, in SeriesGroupBy.nlargest(self, n, keep)
   1062 data = self._selected_obj
   1063 # Don't change behavior if result index happens to be the same, i.e.
   1064 # already ordered and n >= all group sizes.
-> 1065 result = self._python_apply_general(f, data, not_indexed_same=True)
   1066 return result

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1406, in GroupBy._python_apply_general(self, f, data, not_indexed_same, is_transform, is_agg)
   1403 if not_indexed_same is None:
   1404     not_indexed_same = mutated
-> 1406 return self._wrap_applied_output(
   1407     data,
   1408     values,
   1409     not_indexed_same,
   1410     is_transform,
   1411 )

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/generic.py:390, in SeriesGroupBy._wrap_applied_output(self, data, values, not_indexed_same, is_transform)
    388     result.name = self.obj.name
    389 if not self.as_index and not_indexed_same:
--> 390     result = self._insert_inaxis_grouper(result)
    391     result.index = default_index(len(result))
    392 return result

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1106, in GroupBy._insert_inaxis_grouper(self, result)
   1098 for name, lev, in_axis in zip(
   1099     reversed(self.grouper.names),
   1100     reversed(self.grouper.get_group_levels()),
   (...)
   1103     # GH #28549
   1104     # When using .apply(-), name will be in columns already
   1105     if in_axis and name not in columns:
-> 1106         result.insert(0, name, lev)
   1108 return result

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/frame.py:4776, in DataFrame.insert(self, loc, column, value, allow_duplicates)
   4773 if not isinstance(loc, int):
   4774     raise TypeError("loc must be int")
-> 4776 value = self._sanitize_column(value)
   4777 self._mgr.insert(loc, column, value)

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/frame.py:4870, in DataFrame._sanitize_column(self, value)
   4867     return _reindex_for_setitem(Series(value), self.index)
   4869 if is_list_like(value):
-> 4870     com.require_length_match(value, self.index)
   4871 return sanitize_array(value, self.index, copy=True, allow_2d=True)

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/common.py:576, in require_length_match(data, index)
    572 """
    573 Check the length of data matches the length of the index.
    574 """
    575 if len(data) != len(index):
--> 576     raise ValueError(
    577         "Length of values "
    578         f"({len(data)}) "
    579         "does not match length of index "
    580         f"({len(index)})"
    581     )

ValueError: Length of values (2) does not match length of index (5)

Expected Behavior

Should not throw an exception

Installed Versions

INSTALLED VERSIONS

commit : 965ceca
python : 3.8.16.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.2
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 66.0.0
pip : 23.0.1
Cython : 0.29.34
pytest : 7.3.1
hypothesis : None
sphinx : 7.0.0
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : 2.9.6
jinja2 : 3.1.2
IPython : 8.12.1
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli : 1.0.9
fastparquet : 2022.12.0
fsspec : 2023.4.0
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.19.1
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.4.0
scipy : 1.10.1
snappy : None
sqlalchemy : 1.4.45
tables : 3.8.0
tabulate : None
xarray : 2023.1.0
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rahulsiloniya · 2023-06-18T14:13:21Z

Ran some type checks on returns from both values of as_index,

import pandas as pd
df = pd.DataFrame({'A': [1, 1, 2, 1, 2], 'B': [5, 6, 7, 8, 9]})

series_result = df.groupby('A', as_index=False)['B']
print("type : ", type(series_result))
ser_result_true = df.groupby('A')['B']
print("type : ", type(ser_result_true))

The return types for both are apparently different as shown,

type :  <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
type :  <class 'pandas.core.groupby.generic.SeriesGroupBy'>

It turns out, DataFrameGroupBy class objects don't support nlargest()

In pandas/core/series.py it is mentioned that as_index can be False only as DataFrame,

if not as_index:
            raise TypeError("as_index=False only valid with DataFrame")

You can use this as a workaround, cast the groupby() output to pd.DataFrame() then the nlargest() method will require two arguments n: int and columns: list.

Here is an example to start with,

import pandas as pd
import tabulate

# pass as_index as False and cast to pandas.DataFrame

series_result = pd.DataFrame(df.groupby('A', as_index=False)['B'])
print("type : ", type(series_result))

# pass regular call to groupby
ser_result_true = df.groupby('A')['B']
print("type : ", type(ser_result_true))


print("Showing series_result (with as_index = False)")
print(series_result.to_markdown())
print('Showing ser_result_true')

# casted only to print as markdown

print(pd.DataFrame(ser_result_true).to_markdown())
result_true = ser_result_true.nlargest()

# result from default call
print('result_true markdown')
print(result_true.to_markdown())

# result from as_index=False call to groupby

result_false = series_result.nlargest(1, columns=[0])
print('result_false markdown')
print(result_false.to_markdown())

rhshadrach · 2023-06-19T02:04:17Z

Thanks for the report! This is some what complicated by #53707. If we were to treat nsmallest and nlargest as filters, then as_index should make no difference. My recommendation to hold off fixing this until these do behave like filters, and then ensure as_index plays no role in producing the result.

mvashishtha added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2023

rhshadrach added the Groupby label Jun 19, 2023

rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: SeriesGroupBy nlargest with as_index=False raises ValueError: Length of values (2) does not match length of index (5) #53706

BUG: SeriesGroupBy nlargest with as_index=False raises ValueError: Length of values (2) does not match length of index (5) #53706

mvashishtha commented Jun 16, 2023

INSTALLED VERSIONS

rahulsiloniya commented Jun 18, 2023

rhshadrach commented Jun 19, 2023

BUG: SeriesGroupBy nlargest with as_index=False raises ValueError: Length of values (2) does not match length of index (5) #53706

BUG: SeriesGroupBy nlargest with as_index=False raises ValueError: Length of values (2) does not match length of index (5) #53706

Comments

mvashishtha commented Jun 16, 2023

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rahulsiloniya commented Jun 18, 2023

rhshadrach commented Jun 19, 2023