Skip to content

BUG: column name conflict & as_index=False breaks groupby ops #8585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 28, 2014

Conversation

behzadnouri
Copy link
Contributor

closes #7115
closes #8112
closes #8582

@jreback
Copy link
Contributor

jreback commented Oct 19, 2014

what is the point of the added argument in_axis?

nvm, I realized that's internal, so ok

@jreback
Copy link
Contributor

jreback commented Oct 19, 2014

fyi, need to put closes separately for each issue

i'll have a look soon -

@jreback jreback added this to the 0.15.1 milestone Oct 19, 2014
@behzadnouri
Copy link
Contributor Author

this line relies on the grouper name in order to infer which columns to exclude from _selected_obj, and will break if a grouper has a name conflict with one of the frame columns, as in #8112 where it drops a column because of name conflict or as in #7115 where df['a']< 2 has the same name as df['a'].

with this patch, the groupers which should be excluded are explicitly identified at initialization of groupers by setting the value of in_axis.

>>> df = pd.DataFrame({'a':range(5), 'b': range(5, 10)})
>>> gr = df.groupby(df.a<2)  # name conflict
>>> gr.nth(0)  # invokes set_selection_from_grouper internally
   b
a   
0  5
2  7
>>> gr.apply(max)  # column `a` is dropped
       b
a       
False  9
True   6
>>> df.groupby(df.a<2).apply(max)  # if gr.nth(0) was not called first
       a  b
a          
False  4  9
True   1  6

@jreback
Copy link
Contributor

jreback commented Oct 25, 2014

@behzadnouri this looks gr8!. thanks!

need a little example in v0.15.1.txt, in the API section. pls show the former behavior (use your example above, maybe a little shorter) as a code-block then add the new.

Its really a bug-fix, but its a change, so important to show (and reference the 3 issues).

lmk when ready and green.

@behzadnouri
Copy link
Contributor Author

added the example v0.15.1.txt

0 0 81 97
1 5 62 75
2 10 93 41

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, then add the current results after (using the same constructed data). Also pls show the df's here right after you construct them (you can do it in the current results). thxs. This is a nice fix but needs to be explained to users in detail.

@behzadnouri
Copy link
Contributor Author

done!

@jreback
Copy link
Contributor

jreback commented Oct 27, 2014

This doesn't look right (#8112), should be Flalse/True for the index

In [13]: gr.nth(0)
Out[13]: 
     jim  joe
jim     

0      0    5
2      2    7

Here is with first

In [12]: gr.first()
Out[12]: 
       jim  joe
jim            
False    2    7
True     0    5

Also pls show this example as it changes in this release as well

@behzadnouri
Copy link
Contributor Author

I think .nth by design shows the original frame's index, so it should not look like .first in any case; but it seems buggy that when the grouper names conflicts with the columns, it takes the columns values as the index:

on master:

In [1]: df = pd.DataFrame({'jim': range(5), 'joe': range(5, 10)},
   ...:                   index=list('abcde'))

In [2]: df
Out[2]: 
   jim  joe
a    0    5
b    1    6
c    2    7
d    3    8
e    4    9

In [3]: df.groupby(df['jim'] < 2).nth(0)   # buggy, drops the column
Out[3]: 
     joe
jim     
0      5
2      7

In [4]: df.groupby(df['jim'] < 2, as_index=False).nth(0) # ok!
Out[4]: 
   jim  joe
a    0    5
c    2    7

In [5]: df.groupby((df['jim'] < 2).values).nth(0)  # ok!
Out[5]: 
   jim  joe
a    0    5
c    2    7

In [6]: df.groupby(Series(df['jim'] < 2, name='xyz')).nth(0)  # ok!
Out[6]: 
   jim  joe
a    0    5
c    2    7

on branch

In [3]: df.groupby(df['jim'] < 2).nth(0)  # still buggy
Out[3]: 
     jim  joe
jim          
0      0    5
2      2    7

@jreback
Copy link
Contributor

jreback commented Oct 27, 2014

This is on master (the behavior of as_index=True/False)

In [7]: df
Out[7]: 
     A      B         C         D
0  foo    one  1.717396 -0.534594
1  bar    one -0.884431 -0.652615
2  foo    two -0.597500  1.097545
3  bar  three  0.737890  1.012258
4  foo    two  1.430802 -1.410305
5  bar    two -0.233767  0.739961
6  foo    one -0.726651  0.575353
7  foo  three  1.392807 -0.891813

In [8]: df.groupby('A').nth(0)
Out[8]: 
       B         C         D
A                           
bar  one -0.884431 -0.652615
foo  one  1.717396 -0.534594

In [9]: df.groupby('A',as_index=False).nth(0)
Out[9]: 
     A    B         C         D
0  foo  one  1.717396 -0.534594
1  bar  one -0.884431 -0.652615

@behzadnouri
Copy link
Contributor Author

what should be the index be with:

  1. df.groupby(df['A'] == 'foo')?
  2. df.groupby((df['A'] == 'foo').values)?
  3. df.groupby(Series(df['A'] == 'foo', name='xyz'))?

either way, i think nth has inconsistent behavior, but this issue is not raised in any of #8582, #8112, #7115.

I mean above issues are about result columns not the result index.

jreback added a commit that referenced this pull request Oct 28, 2014
BUG: column name conflict & as_index=False breaks groupby ops
@jreback jreback merged commit f2c9390 into pandas-dev:master Oct 28, 2014
@jreback
Copy link
Contributor

jreback commented Oct 28, 2014

@behzadnouri thanks, FYI the master issue for as_index consistency is #5755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants