Skip to content

TST in .drop and .groupby for dataframes with multi-indexed columns #11717

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

TST in .drop and .groupby for dataframes with multi-indexed columns #11717

wants to merge 1 commit into from

Conversation

nbonnotte
Copy link
Contributor

Bug 1 [edit: issues #11640 and #12078, fixed by PR #12158 but not entirely tested]

There was a bug deeply hidden in pandas/core/index.py, where it was assumed that the .get_loc method of an index would only return an integer or a slice, while it can also return a mask.

Simply correcting that bug has two consequences:

  • now for a dataframe df with multi-indexed columns, df.drop('a', axis=1) works if df['a'] returns something, e.g. if the column ('a', '') exists
  • likewise, df.groupby('a') also works in that case

Example:

In [2]: df = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]])

In [3]: dg = df.pivot_table(index='a', columns=['b','c'], values='d').reset_index()

In [4]: dg
Out[4]: 
b  a b1 b2
c    c1 c2
0  1  3  4

In [5]: dg['a']
Out[5]: 
0    1
Name: a, dtype: int64

In [8]: dg.drop('a', axis=1)
Out[8]: 
b b1 b2
c c1 c2
0  3  4

In [9]: dg.groupby('a').mean()
Out[9]: 
b b1 b2
c c1 c2
a      
1  3  4

Bug 2 [edit: #11741, solved by PR #12063]

I also correct a little bug I mentioned in the talk on issue #11640: .groupby failed to raise a KeyError for a non-existing column when there was only one row:

In [6]: dg.groupby('z').mean()
Out[6]: 
b  a b1 b2
c    c1 c2
z  1  3  4

Now we do have:

In [4]: dg.groupby('z').mean()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
...
KeyError: 'z'

@jreback jreback added Groupby Error Reporting Incorrect or improved errors from pandas labels Nov 29, 2015
@jreback jreback added this to the 0.18.0 milestone Nov 29, 2015
inds.extend(lrange(loc.start, loc.stop))
else: # loc is a mask
loc = [i for i, val in enumerate(loc) if val]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loc = loc.nonzero()[0]

and do elsif is_bool_indexer(loc)
then make the else raise an AssertionError (meaning we don't expect to hit this and its a bug)

@nbonnotte
Copy link
Contributor Author

@jreback I go back here because I've spotted something really, really wrong, and I don't want my comment to disappear when the commit gets outdated.

For the completeness of the discussion, I'll recall you just defined two dataframes

In [39]: df1 = DataFrame([[1,3,4]],columns=pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c']))
In [41]: df2 = pd.DataFrame([[1,2,3,4]],columns=pd.MultiIndex.from_tuples([('a','a1'),('a','a2'),('b1','c1'),('b2','c2')],names=['b','c'])) 

This df1 is meant to be identical to the dataframe dg I introduced in my first message.

Except that it is not.

Issue 1: df1 is not the same as dg!

On master, both df1.drop('a', axis=1) and df2.drop('a', axis=1) work. So does df1.groupby('a'), but not df2.groupby('a') since it is not 1-dimensional.

On the other hand, neither work for my weird dataframe dg obtained with .pivot_table and .reset.

Independently from what the API should or should not allow, there seems to be another bug: how is it possible that my dg and your df1 look the same, but do not behave the same?

Issue 2: what the API should or should not allow

As for the API, I'm not sure I understand what it should be.

What should, or should not, work, in the two following cases?

  • df1.groupby('a') and dg.groupby('a') (assuming df1 and dg are the same)
  • df1.drop('a', axis=1) and dg.drop('a', axis=1)

I don't think either would make the API more confusing:

  • it would seem natural to me that df1.drop('a', axis=1) work the same as df1.drop('a', axis=1, level=0)
  • it would also seem natural to me that df1.groupby('a') work if and only if df1.groupby(df1['a']) work

In short, I'm in favor of just solving the first issue, that is of making dg work exactly as df1 work now.

@jreback
Copy link
Contributor

jreback commented Nov 30, 2015

@nbonnotte you are making this discussion about too many things.
df1.drop('a',axis=1) the same as df1.drop('a',axis=1,level=0) is a much larger discussion and will not be handled here.

@jreback jreback modified the milestones: Next Major Release, 0.18.0 Nov 30, 2015
@jreback
Copy link
Contributor

jreback commented Nov 30, 2015

show a very specific copy-pastable example of where the constructed dg and df1 are not the same (not they are NOT exactly the same as the levels of the columns are different; this is expected).

@nbonnotte
Copy link
Contributor Author

dg = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]]).pivot_table(index='a', columns=['b','c'], values='d').reset_index()
df1 = pd.DataFrame([[1,3,4]],columns=pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c']))

If this discussion goes in too many directions, it's because I'm a bit confused about the direction to take.

What exactly should be the result of this pull request?

  1. It can solve a little bug in pandas/core/index.py, with the consequence that dg will behave the same regarding to .drop and .groupby as how df1 already behaves today. This is what the PR in its present state does. The resulting API may be confusing, but it will not be more confusing that what it is today.

  2. An exception with a clear message can be raised for dg.drop('a', axis=1) and dg.groupby('a'). It will allow nothing new compared with today, but the error message will be more explicit. But then, dg and df1, which are not the same but still for all intents and purpose are the same, will keep not behaving the same.

  3. An exception can be raised both for df1 and dg. Then, some code that currently work for dg will not work any more. This is problematic, and I just mention this possibility to list all alternatives.

Should I go for 2) then?

@jreback
Copy link
Contributor

jreback commented Dec 1, 2015

So this is a bit more complicated. I believe the difference is because of the lex sorting depth

In [1]: df1 = pd.DataFrame([[1,3,4]],columns=pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c']))

In [4]: result = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]]).pivot_table(index='a', columns=['b','c'], values='d').reset_index()

In [5]: result.columns.lexsort_depth
Out[5]: 0

In [7]: df1.columns.lexsort_depth
Out[7]: 2
In [16]: pd.MultiIndex.from_tuples(result.columns.values)
Out[16]: 
MultiIndex(levels=[[u'a', u'b1', u'b2'], [u'', u'c1', u'c2']],
           labels=[[0, 1, 2], [0, 1, 2]])

If the index is rebuilt it is the same.

In [17]: pd.MultiIndex.from_tuples(result.columns.values).lexsort_depth
Out[17]: 2

Of this said, I don't actually think what I am saying here is a bug. These are both equally valid indexes.

So I have to think about this some more.

@nbonnotte
Copy link
Contributor Author

@jreback have you had the time to think about this?

If there are doubts, I'd suggest going for my option 2, which is conservative:

  1. An exception with a clear message can be raised for dg.drop('a', axis=1) and dg.groupby('a'). It will allow nothing new compared with today, but the error message will be more explicit. But then, dg and df1, which are not the same but still for all intents and purpose are the same, will keep not behaving the same.

so that we have a nice error message. Then we can create another issue about the (already existing) lack of consistency in the API.

inds.extend(lrange(loc.start, loc.stop))
elif is_bool_indexer(loc):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's restrict this issue only to the match_axis_length issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right.

I'll start an issue and a PR for the KeyError (bug 2 mentioned in the first message in this PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This last bug has since been reported as #11741

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solved with PR #12063

jreback pushed a commit that referenced this pull request Jan 17, 2016
No KeyError was raised when grouping by a non-existant column

Fixes #11741

Xref issue #11640, PR #11717
@nbonnotte
Copy link
Contributor Author

@jreback I've rewritten the tests in such a way as to make the changes somewhat more compelling

I hope you'll be convinced this does not make the API more confusing.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2016

@nbonnotte my point is this is not a bug

In [30]: df = pd.DataFrame(columns=['a', 'b', 'c', 'd'],data=[[1, 'b1', 'c1', 3],[1, 'b2', 'c2', 4]])

In [31]: df
Out[31]: 
   a   b   c  d
0  1  b1  c1  3
1  1  b2  c2  4

In [32]: df2 = df.pivot_table(index='a', columns=['b', 'c'], values='d')

In [33]: df2
Out[33]: 
b b1 b2
c c1 c2
a      
1  3  4

# this is exactly what it should do, 'a' is NOT in the index
In [34]: df2.groupby('a').mean()
KeyError: 'a'

In [35]: df2.groupby('a',level=0).mean()
Out[35]: 
b b1 b2
c c1 c2
a      
1  3  4

@jreback
Copy link
Contributor

jreback commented Jan 17, 2016

In [36]: df = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]])

In [37]: 

In [37]: In [3]: df = df.pivot_table(index='a', columns=['b','c'], values='d').reset_index()

In [39]: df.groupby('a').mean()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-a830c6135818> in <module>()
----> 1 df.groupby('a').mean()

/Users/jreback/pandas/pandas/core/groupby.pyc in mean(self)
    908             self._set_selection_from_grouper()
    909             f = lambda x: x.mean(axis=self.axis)
--> 910             return self._python_agg_general(f)
    911 
    912     @Substitution(name='groupby')

/Users/jreback/pandas/pandas/core/groupby.pyc in _python_agg_general(self, func, *args, **kwargs)
    751                 output[name] = self._try_cast(values[mask], result)
    752 
--> 753         return self._wrap_aggregated_output(output)
    754 
    755     def _wrap_applied_output(self, *args, **kwargs):

/Users/jreback/pandas/pandas/core/groupby.pyc in _wrap_aggregated_output(self, output, names)
   3511     def _wrap_aggregated_output(self, output, names=None):
   3512         agg_axis = 0 if self.axis == 1 else 1
-> 3513         agg_labels = self._obj_with_exclusions._get_axis(agg_axis)
   3514 
   3515         output_keys = self._decide_output_index(output, agg_labels)

/Users/jreback/pandas/pandas/core/groupby.pyc in __getattr__(self, attr)
    457 
    458         raise AttributeError("%r object has no attribute %r" %
--> 459                              (type(self).__name__, attr))
    460 
    461     plot = property(GroupByPlot)

AttributeError: 'DataFrameGroupBy' object has no attribute '_obj_with_exclusions'

In [40]: df.groupby('a',level=0).mean()
Out[40]: 
b  a b1 b2
c    c1 c2
0  1  3  4

from your original issue [39] should simply give a more informative errors message (and raise a KeyError), not actually try to fix the issue when the level is not specified, which would be a major API change.

@nbonnotte
Copy link
Contributor Author

@jreback my point is that, in the tests I wrote, df and reference are the same, in the sense of assert_frame_equal(df, reference), and yet today the same procedure raises an error for df and not for reference.

I'd be very happy to replace that error with a more explicit one, and start a new issue about the lack of consistency of the API. But first I just wanted to be sure things were clear enough ^^

@jreback
Copy link
Contributor

jreback commented Jan 17, 2016

you cannot change index code like that

just address the incorrect error message in the original issue

one issue per pr

@nbonnotte
Copy link
Contributor Author

Here we go.

I don't know if I've chosen the best place to raise a meaningful error, but like this it should at least be conservative (that is, not break nor enable new things).

Likewise, I'll be glad to have any feedback on the comments. I hope they're clear enough.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2016

no, don't try to fix anything in indexing. This is purely a groupby issue. You are still trying to make something that is an API change work.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2016

remove all of the indexing code that you added.

@nbonnotte
Copy link
Contributor Author

I'm not changing anything about the API. The result is just a better error message.

If I remove my changes in the indexing code, with some .groupby and .drop operations the method .get_loc() returns a boolean mask, which is not handled. So we get an AttributeError, which (when it is a .groupby) is caught and then stuff happen that should not, resulting in another obscur error.

So do you want me to instead catch that last obscur error at the top level, and replace it with a nice message?

Then I can create two other issues: one for .drop, and one for the inconsistency of the API.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2016

this is a much simpler change

just step thru and catch it where it happens

@nbonnotte
Copy link
Contributor Author

Ok.

But I would feel more comfortable working on this issue once the .drop problem has been fixed, so I'll start directly a new issue for that.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2016

the 'drop' issue is an API change that would need much more testing and review

@nbonnotte
Copy link
Contributor Author

I understand. I'm just talking about the obscur error message that happens in some cases when dropping, see #12078

@nbonnotte nbonnotte changed the title BUG in .drop and .groupby for dataframes with multi-indexed columns TST in .drop and .groupby for dataframes with multi-indexed columns Jan 29, 2016
@nbonnotte
Copy link
Contributor Author

Following PR #12158, the bug has disappeared.

This pull request now only contains tests.

If there are problems with the API, I'd be happy to throw errors where need be. But maybe that should be for another issue?

@jreback jreback added this to the 0.18.0 milestone Jan 29, 2016
jreback pushed a commit to jreback/pandas that referenced this pull request Jan 29, 2016
@jreback jreback closed this in b291dd6 Jan 29, 2016
@jreback
Copy link
Contributor

jreback commented Jan 29, 2016

@nbonnotte thanks!

@nbonnotte nbonnotte deleted the attribute-error-11640 branch January 29, 2016 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants