Skip to content

Result of groupby -> select -> nlargest duplicates index levels #17477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JamesOwers opened this issue Sep 8, 2017 · 3 comments
Open

Result of groupby -> select -> nlargest duplicates index levels #17477

JamesOwers opened this issue Sep 8, 2017 · 3 comments
Labels
Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@JamesOwers
Copy link

To reproduce:

import pandas as pd
import numpy as np
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index2 = pd.MultiIndex.from_tuples(tuples, names=['1st', '2nd'])
df = pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index2[:6])
df.stack('1st').\
    groupby(level=['1st', 'first'])['one'].\
    nlargest(3)

output:

1st  first  first  second  1st
bar  bar    bar    two     bar    1.985984
                   one     bar   -0.175834
     baz    baz    two     bar    1.089335
                   one     bar   -1.447996
     foo    foo    one     bar    1.154403
                   two     bar   -0.112992
baz  bar    bar    one     baz    0.528254
                   two     baz   -2.052733
     baz    baz    one     baz   -0.447332
                   two     baz   -0.818858
     foo    foo    one     baz   -0.939259
                   two     baz   -1.282821
foo  bar    bar    two     foo    2.075440
                   one     foo   -0.423804
     baz    baz    one     foo    0.250846
                   two     foo    0.177904
     foo    foo    two     foo   -0.468794
                   one     foo   -0.627487
Name: one, dtype: float64

expected result can be obtained by resetting the index:

df.stack('1st').\
    groupby(level=['1st', 'first'])['one'].\
    nlargest(3).\
    reset_index(level=[0, 1], drop=True)

The issue is a problem because the resulting index now contains unnecessary duplicates. The behaviour is also not consistent with when all the columns are grouped by:

df.stack('1st').\
    groupby(level=['1st', 'first', 'second'])['one'].\
    nlargest(3)

Out:

first  second  1st
bar    one     bar   -0.175834
               baz    0.528254
               foo   -0.423804
       two     bar    1.985984
               baz   -2.052733
               foo    2.075440
baz    one     bar   -1.447996
               baz   -0.447332
               foo    0.250846
       two     bar    1.089335
               baz   -0.818858
               foo    0.177904
foo    one     bar    1.154403
               baz   -0.939259
               foo   -0.627487
       two     bar   -0.112992
               baz   -1.282821
               foo   -0.468794
Name: one, dtype: float64

Pandas version 0.20.3

@jreback
Copy link
Contributor

jreback commented Sep 8, 2017

pls try on master this is fixed

@gfyoung gfyoung added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Sep 9, 2017
@jreback
Copy link
Contributor

jreback commented Sep 9, 2017

this was last changed in #15299. I suppose its not correct for an already existing MI index. If you want to look deeper would be helpful.

@jreback jreback modified the milestones: No action, Next Major Release Sep 9, 2017
@jreback
Copy link
Contributor

jreback commented Sep 9, 2017

cc @RogerThomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

5 participants