Result of groupby -> select -> nlargest duplicates index levels #17477

JamesOwers · 2017-09-08T16:31:57Z

To reproduce:

import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index2 = pd.MultiIndex.from_tuples(tuples, names=['1st', '2nd'])
df = pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index2[:6])
df.stack('1st').\
    groupby(level=['1st', 'first'])['one'].\
    nlargest(3)

output:

1st  first  first  second  1st
bar  bar    bar    two     bar    1.985984
                   one     bar   -0.175834
     baz    baz    two     bar    1.089335
                   one     bar   -1.447996
     foo    foo    one     bar    1.154403
                   two     bar   -0.112992
baz  bar    bar    one     baz    0.528254
                   two     baz   -2.052733
     baz    baz    one     baz   -0.447332
                   two     baz   -0.818858
     foo    foo    one     baz   -0.939259
                   two     baz   -1.282821
foo  bar    bar    two     foo    2.075440
                   one     foo   -0.423804
     baz    baz    one     foo    0.250846
                   two     foo    0.177904
     foo    foo    two     foo   -0.468794
                   one     foo   -0.627487
Name: one, dtype: float64

expected result can be obtained by resetting the index:

df.stack('1st').\
    groupby(level=['1st', 'first'])['one'].\
    nlargest(3).\
    reset_index(level=[0, 1], drop=True)

The issue is a problem because the resulting index now contains unnecessary duplicates. The behaviour is also not consistent with when all the columns are grouped by:

df.stack('1st').\
    groupby(level=['1st', 'first', 'second'])['one'].\
    nlargest(3)

Out:

first  second  1st
bar    one     bar   -0.175834
               baz    0.528254
               foo   -0.423804
       two     bar    1.985984
               baz   -2.052733
               foo    2.075440
baz    one     bar   -1.447996
               baz   -0.447332
               foo    0.250846
       two     bar    1.089335
               baz   -0.818858
               foo    0.177904
foo    one     bar    1.154403
               baz   -0.939259
               foo   -0.627487
       two     bar   -0.112992
               baz   -1.282821
               foo   -0.468794
Name: one, dtype: float64

Pandas version 0.20.3

The text was updated successfully, but these errors were encountered:

jreback · 2017-09-08T17:00:37Z

pls try on master this is fixed

jreback · 2017-09-09T15:56:45Z

this was last changed in #15299. I suppose its not correct for an already existing MI index. If you want to look deeper would be helpful.

jreback · 2017-09-09T15:57:26Z

cc @RogerThomas

gfyoung added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Sep 9, 2017

jreback added Bug Difficulty Intermediate Groupby labels Sep 9, 2017

jreback modified the milestones: No action, Next Major Release Sep 9, 2017

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rhshadrach mentioned this issue Jul 15, 2023

BUG: make SeriesGroupBy nlargest and nsmallest behave like other filtrations #53707

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Result of groupby -> select -> nlargest duplicates index levels #17477

Result of groupby -> select -> nlargest duplicates index levels #17477

JamesOwers commented Sep 8, 2017

jreback commented Sep 8, 2017

jreback commented Sep 9, 2017

jreback commented Sep 9, 2017

Result of groupby -> select -> nlargest duplicates index levels #17477

Result of groupby -> select -> nlargest duplicates index levels #17477

Comments

JamesOwers commented Sep 8, 2017

jreback commented Sep 8, 2017

jreback commented Sep 9, 2017

jreback commented Sep 9, 2017