Inconsistent handling of index after groupby operation #15272

pirsquared · 2017-01-31T07:01:52Z

snippet 1


df = pd.DataFrame(dict(A=[0, 1, 2, 3]))

# returns results identical to df.A
print(df.groupby(df.A // 2).A.nsmallest(2))

# returns results out of order
print(df.groupby(df.A // 2).A.nlargest(2))

0    0
1    1
2    2
3    3
Name: A, dtype: int64
A   
0  1    1
   0    0
1  3    3
   2    2
Name: A, dtype: int64

snippet 2


df = pd.DataFrame(dict(A=[0, 1, 2, 3]))

print(df.groupby(df.A // 2).A.apply(pd.Series.sample, n=2))

Problem description

When the results of a groupby operation return the same results as what was in a the group in the first place, the index is left identical to the object being grouped. This doesn't sound so horrible until you realize that it is inconsistent with very comparable operations. This is observed in snippet 1. However, snippet 2 puts a finer point on it. The same code sample produces randomly different results.

Expected Output

A   
0  1    0
   0    1
1  3    2
   2    3
Name: A, dtype: int64
A   
0  1    1
   0    0
1  3    3
   2    2
Name: A, dtype: int64

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 16.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: 1.3.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: 0.999999999
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: 0.2.1

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2017-01-31T09:15:14Z

@pirsquared Thanks for the report!

The nlargest and nsmallest should indeed be consistent with each other, and produce always the hierarchical index I think.

The apply one is a more difficult issue. It tries to infer what to do based on the return value. It is true that it is no good that the output shape is not consistent in your example, but of course, the sample has also a random aspect, so not sure how this could be solved.

jreback · 2017-01-31T14:52:02Z

agree with @jorisvandenbossche here. These are implemented using .apply, which does best-efforts to coerce the final shape. But these should actually use a slightly lower level API which will direct the reshaping in a consistent manner.

This is very similar to #15260 (comment)

where we need to apply a function , but control the final shape as it cannot be inferred properly.

jorisvandenbossche added Bug Groupby labels Jan 31, 2017

jreback added Difficulty Intermediate labels Jan 31, 2017

jreback added this to the 0.20.0 milestone Jan 31, 2017

jreback mentioned this issue Feb 1, 2017

DataFrameGroupBy.idxmin() returns DataFrame, documentation says Series #15275

Closed

jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

rhshadrach mentioned this issue Jul 18, 2021

BUG: SeriesGroupBy.nlargest/smallest inconsistent shape #42596

Merged

6 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Aug 5, 2021

jreback closed this as completed in #42596 Aug 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent handling of index after groupby operation #15272

Inconsistent handling of index after groupby operation #15272

pirsquared commented Jan 31, 2017 •

edited by jorisvandenbossche

Loading

INSTALLED VERSIONS

jorisvandenbossche commented Jan 31, 2017

jreback commented Jan 31, 2017

Inconsistent handling of index after groupby operation #15272

Inconsistent handling of index after groupby operation #15272

Comments

pirsquared commented Jan 31, 2017 • edited by jorisvandenbossche Loading

snippet 1

snippet 2

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Jan 31, 2017

jreback commented Jan 31, 2017

pirsquared commented Jan 31, 2017 •

edited by jorisvandenbossche

Loading

Output of `pd.show_versions()`