Skip to content

BUG: SeriesGroupBy nlargest with as_index=False raises ValueError: Length of values (2) does not match length of index (5) #53706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
mvashishtha opened this issue Jun 16, 2023 · 2 comments

Comments

@mvashishtha
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({'A': [1, 1, 2, 1, 2], 'B': [5, 6, 7, 8, 9]})
df.groupby('A', as_index=False)['B'].nlargest()

Issue Description

I can group with as_index=True and I get a series with the group keys + the original keys as a row multiindex. with as_index=False I get this exception:

Details
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[155], line 4
      1 import pandas as pd
      3 df = pd.DataFrame({'A': [1, 1, 2, 1, 2], 'B': [5, 6, 7, 8, 9]})
----> 4 df.groupby('A', as_index=False)['B'].nlargest()

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/generic.py:1065, in SeriesGroupBy.nlargest(self, n, keep)
   1062 data = self._selected_obj
   1063 # Don't change behavior if result index happens to be the same, i.e.
   1064 # already ordered and n >= all group sizes.
-> 1065 result = self._python_apply_general(f, data, not_indexed_same=True)
   1066 return result

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1406, in GroupBy._python_apply_general(self, f, data, not_indexed_same, is_transform, is_agg)
   1403 if not_indexed_same is None:
   1404     not_indexed_same = mutated
-> 1406 return self._wrap_applied_output(
   1407     data,
   1408     values,
   1409     not_indexed_same,
   1410     is_transform,
   1411 )

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/generic.py:390, in SeriesGroupBy._wrap_applied_output(self, data, values, not_indexed_same, is_transform)
    388     result.name = self.obj.name
    389 if not self.as_index and not_indexed_same:
--> 390     result = self._insert_inaxis_grouper(result)
    391     result.index = default_index(len(result))
    392 return result

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/groupby/groupby.py:1106, in GroupBy._insert_inaxis_grouper(self, result)
   1098 for name, lev, in_axis in zip(
   1099     reversed(self.grouper.names),
   1100     reversed(self.grouper.get_group_levels()),
   (...)
   1103     # GH #28549
   1104     # When using .apply(-), name will be in columns already
   1105     if in_axis and name not in columns:
-> 1106         result.insert(0, name, lev)
   1108 return result

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/frame.py:4776, in DataFrame.insert(self, loc, column, value, allow_duplicates)
   4773 if not isinstance(loc, int):
   4774     raise TypeError("loc must be int")
-> 4776 value = self._sanitize_column(value)
   4777 self._mgr.insert(loc, column, value)

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/frame.py:4870, in DataFrame._sanitize_column(self, value)
   4867     return _reindex_for_setitem(Series(value), self.index)
   4869 if is_list_like(value):
-> 4870     com.require_length_match(value, self.index)
   4871 return sanitize_array(value, self.index, copy=True, allow_2d=True)

File ~/anaconda3/envs/modin-dev/lib/python3.8/site-packages/pandas/core/common.py:576, in require_length_match(data, index)
    572 """
    573 Check the length of data matches the length of the index.
    574 """
    575 if len(data) != len(index):
--> 576     raise ValueError(
    577         "Length of values "
    578         f"({len(data)}) "
    579         "does not match length of index "
    580         f"({len(index)})"
    581     )

ValueError: Length of values (2) does not match length of index (5)

Expected Behavior

Should not throw an exception

Installed Versions

INSTALLED VERSIONS

commit : 965ceca
python : 3.8.16.final.0
python-bits : 64
OS : Darwin
OS-release : 22.5.0
Version : Darwin Kernel Version 22.5.0: Mon Apr 24 20:51:50 PDT 2023; root:xnu-8796.121.2~5/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.0.2
numpy : 1.24.3
pytz : 2023.3
dateutil : 2.8.2
setuptools : 66.0.0
pip : 23.0.1
Cython : 0.29.34
pytest : 7.3.1
hypothesis : None
sphinx : 7.0.0
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : 2.9.6
jinja2 : 3.1.2
IPython : 8.12.1
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli : 1.0.9
fastparquet : 2022.12.0
fsspec : 2023.4.0
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : 2.8.4
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.19.1
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2023.4.0
scipy : 1.10.1
snappy : None
sqlalchemy : 1.4.45
tables : 3.8.0
tabulate : None
xarray : 2023.1.0
xlrd : 2.0.1
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@mvashishtha mvashishtha added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 16, 2023
@rahulsiloniya
Copy link
Contributor

Ran some type checks on returns from both values of as_index,

import pandas as pd
df = pd.DataFrame({'A': [1, 1, 2, 1, 2], 'B': [5, 6, 7, 8, 9]})

series_result = df.groupby('A', as_index=False)['B']
print("type : ", type(series_result))
ser_result_true = df.groupby('A')['B']
print("type : ", type(ser_result_true))

The return types for both are apparently different as shown,

type :  <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
type :  <class 'pandas.core.groupby.generic.SeriesGroupBy'>

It turns out, DataFrameGroupBy class objects don't support nlargest()

In pandas/core/series.py it is mentioned that as_index can be False only as DataFrame,

if not as_index:
            raise TypeError("as_index=False only valid with DataFrame")

You can use this as a workaround, cast the groupby() output to pd.DataFrame() then the nlargest() method will require two arguments n: int and columns: list.

Here is an example to start with,

import pandas as pd
import tabulate

# pass as_index as False and cast to pandas.DataFrame

series_result = pd.DataFrame(df.groupby('A', as_index=False)['B'])
print("type : ", type(series_result))

# pass regular call to groupby
ser_result_true = df.groupby('A')['B']
print("type : ", type(ser_result_true))


print("Showing series_result (with as_index = False)")
print(series_result.to_markdown())
print('Showing ser_result_true')

# casted only to print as markdown

print(pd.DataFrame(ser_result_true).to_markdown())
result_true = ser_result_true.nlargest()

# result from default call
print('result_true markdown')
print(result_true.to_markdown())

# result from as_index=False call to groupby

result_false = series_result.nlargest(1, columns=[0])
print('result_false markdown')
print(result_false.to_markdown())

@rhshadrach
Copy link
Member

Thanks for the report! This is some what complicated by #53707. If we were to treat nsmallest and nlargest as filters, then as_index should make no difference. My recommendation to hold off fixing this until these do behave like filters, and then ensure as_index plays no role in producing the result.

@rhshadrach rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants