BUG: column name conflict & as_index=False breaks groupby ops #8585

behzadnouri · 2014-10-19T21:00:29Z

closes #7115
closes #8112
closes #8582

jreback · 2014-10-19T21:13:22Z

what is the point of the added argument in_axis?

nvm, I realized that's internal, so ok

jreback · 2014-10-19T21:27:14Z

fyi, need to put closes separately for each issue

i'll have a look soon -

behzadnouri · 2014-10-19T21:44:44Z

this line relies on the grouper name in order to infer which columns to exclude from _selected_obj, and will break if a grouper has a name conflict with one of the frame columns, as in #8112 where it drops a column because of name conflict or as in #7115 where df['a']< 2 has the same name as df['a'].

with this patch, the groupers which should be excluded are explicitly identified at initialization of groupers by setting the value of in_axis.

>>> df = pd.DataFrame({'a':range(5), 'b': range(5, 10)})
>>> gr = df.groupby(df.a<2)  # name conflict
>>> gr.nth(0)  # invokes set_selection_from_grouper internally
   b
a   
0  5
2  7
>>> gr.apply(max)  # column `a` is dropped
       b
a       
False  9
True   6
>>> df.groupby(df.a<2).apply(max)  # if gr.nth(0) was not called first
       a  b
a          
False  4  9
True   1  6

jreback · 2014-10-25T00:02:12Z

@behzadnouri this looks gr8!. thanks!

need a little example in v0.15.1.txt, in the API section. pls show the former behavior (use your example above, maybe a little shorter) as a code-block then add the new.

Its really a bug-fix, but its a change, so important to show (and reference the 3 issues).

lmk when ready and green.

behzadnouri · 2014-10-26T16:59:11Z

added the example v0.15.1.txt

jreback · 2014-10-27T00:05:24Z

doc/source/whatsnew/v0.15.1.txt

+     0    0   81   97
+     1    5   62   75
+     2   10   93   41
+


ok, then add the current results after (using the same constructed data). Also pls show the df's here right after you construct them (you can do it in the current results). thxs. This is a nice fix but needs to be explained to users in detail.

behzadnouri · 2014-10-27T02:30:22Z

done!

jreback · 2014-10-27T10:51:16Z

This doesn't look right (#8112), should be Flalse/True for the index

In [13]: gr.nth(0)
Out[13]: 
     jim  joe
jim     

0      0    5
2      2    7

Here is with first

In [12]: gr.first()
Out[12]: 
       jim  joe
jim            
False    2    7
True     0    5

Also pls show this example as it changes in this release as well

behzadnouri · 2014-10-27T11:24:59Z

I think .nth by design shows the original frame's index, so it should not look like .first in any case; but it seems buggy that when the grouper names conflicts with the columns, it takes the columns values as the index:

on master:

In [1]: df = pd.DataFrame({'jim': range(5), 'joe': range(5, 10)},
   ...:                   index=list('abcde'))

In [2]: df
Out[2]: 
   jim  joe
a    0    5
b    1    6
c    2    7
d    3    8
e    4    9

In [3]: df.groupby(df['jim'] < 2).nth(0)   # buggy, drops the column
Out[3]: 
     joe
jim     
0      5
2      7

In [4]: df.groupby(df['jim'] < 2, as_index=False).nth(0) # ok!
Out[4]: 
   jim  joe
a    0    5
c    2    7

In [5]: df.groupby((df['jim'] < 2).values).nth(0)  # ok!
Out[5]: 
   jim  joe
a    0    5
c    2    7

In [6]: df.groupby(Series(df['jim'] < 2, name='xyz')).nth(0)  # ok!
Out[6]: 
   jim  joe
a    0    5
c    2    7

on branch

In [3]: df.groupby(df['jim'] < 2).nth(0)  # still buggy
Out[3]: 
     jim  joe
jim          
0      0    5
2      2    7

jreback · 2014-10-27T11:45:34Z

This is on master (the behavior of as_index=True/False)

In [7]: df
Out[7]: 
     A      B         C         D
0  foo    one  1.717396 -0.534594
1  bar    one -0.884431 -0.652615
2  foo    two -0.597500  1.097545
3  bar  three  0.737890  1.012258
4  foo    two  1.430802 -1.410305
5  bar    two -0.233767  0.739961
6  foo    one -0.726651  0.575353
7  foo  three  1.392807 -0.891813

In [8]: df.groupby('A').nth(0)
Out[8]: 
       B         C         D
A                           
bar  one -0.884431 -0.652615
foo  one  1.717396 -0.534594

In [9]: df.groupby('A',as_index=False).nth(0)
Out[9]: 
     A    B         C         D
0  foo  one  1.717396 -0.534594
1  bar  one -0.884431 -0.652615

behzadnouri · 2014-10-27T13:36:49Z

what should be the index be with:

df.groupby(df['A'] == 'foo')?
df.groupby((df['A'] == 'foo').values)?
df.groupby(Series(df['A'] == 'foo', name='xyz'))?

either way, i think nth has inconsistent behavior, but this issue is not raised in any of #8582, #8112, #7115.

I mean above issues are about result columns not the result index.

BUG: column name conflict & as_index=False breaks groupby ops

jreback · 2014-10-28T00:05:18Z

@behzadnouri thanks, FYI the master issue for as_index consistency is #5755

jreback added Bug Groupby labels Oct 19, 2014

jreback added this to the 0.15.1 milestone Oct 19, 2014

behzadnouri force-pushed the in-axis branch from 76c1e3a to 5b00578 Compare October 26, 2014 16:16

jreback reviewed Oct 27, 2014
View reviewed changes

BUG: column name conflict & as_index=False breaks groupby ops

fef0f4a

behzadnouri force-pushed the in-axis branch from 5b00578 to fef0f4a Compare October 27, 2014 01:02

jreback added a commit that referenced this pull request Oct 28, 2014

Merge pull request #8585 from behzadnouri/in-axis

f2c9390

BUG: column name conflict & as_index=False breaks groupby ops

jreback merged commit f2c9390 into pandas-dev:master Oct 28, 2014

jreback modified the milestones: 0.15.2, 0.15.1 Oct 30, 2014

aimboden mentioned this pull request Nov 10, 2014

BUG: categorical column dropped from groupby agg result when as_index=False #8770

Closed

behzadnouri deleted the in-axis branch December 7, 2014 13:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: column name conflict & as_index=False breaks groupby ops #8585

BUG: column name conflict & as_index=False breaks groupby ops #8585

behzadnouri commented Oct 19, 2014

jreback commented Oct 19, 2014

jreback commented Oct 19, 2014

behzadnouri commented Oct 19, 2014

jreback commented Oct 25, 2014

behzadnouri commented Oct 26, 2014

jreback Oct 27, 2014

behzadnouri commented Oct 27, 2014

jreback commented Oct 27, 2014

behzadnouri commented Oct 27, 2014

jreback commented Oct 27, 2014

behzadnouri commented Oct 27, 2014

jreback commented Oct 28, 2014

BUG: column name conflict & as_index=False breaks groupby ops #8585

BUG: column name conflict & as_index=False breaks groupby ops #8585

Conversation

behzadnouri commented Oct 19, 2014

jreback commented Oct 19, 2014

jreback commented Oct 19, 2014

behzadnouri commented Oct 19, 2014

jreback commented Oct 25, 2014

behzadnouri commented Oct 26, 2014

jreback Oct 27, 2014

Choose a reason for hiding this comment

behzadnouri commented Oct 27, 2014

jreback commented Oct 27, 2014

behzadnouri commented Oct 27, 2014

jreback commented Oct 27, 2014

behzadnouri commented Oct 27, 2014

jreback commented Oct 28, 2014