TST in .drop and .groupby for dataframes with multi-indexed columns #11717

nbonnotte · 2015-11-28T17:46:30Z

Bug 1 [edit: issues #11640 and #12078, fixed by PR #12158 but not entirely tested]

There was a bug deeply hidden in pandas/core/index.py, where it was assumed that the .get_loc method of an index would only return an integer or a slice, while it can also return a mask.

Simply correcting that bug has two consequences:

now for a dataframe df with multi-indexed columns, df.drop('a', axis=1) works if df['a'] returns something, e.g. if the column ('a', '') exists
likewise, df.groupby('a') also works in that case

Example:

In [2]: df = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]])

In [3]: dg = df.pivot_table(index='a', columns=['b','c'], values='d').reset_index()

In [4]: dg
Out[4]: 
b  a b1 b2
c    c1 c2
0  1  3  4

In [5]: dg['a']
Out[5]: 
0    1
Name: a, dtype: int64

In [8]: dg.drop('a', axis=1)
Out[8]: 
b b1 b2
c c1 c2
0  3  4

In [9]: dg.groupby('a').mean()
Out[9]: 
b b1 b2
c c1 c2
a      
1  3  4

Bug 2 [edit: #11741, solved by PR #12063]

I also correct a little bug I mentioned in the talk on issue #11640: .groupby failed to raise a KeyError for a non-existing column when there was only one row:

In [6]: dg.groupby('z').mean()
Out[6]: 
b  a b1 b2
c    c1 c2
z  1  3  4

Now we do have:

In [4]: dg.groupby('z').mean()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
...
KeyError: 'z'

jreback · 2015-11-29T17:40:54Z

pandas/core/index.py

                    inds.extend(lrange(loc.start, loc.stop))
+                else:  # loc is a mask
+                    loc = [i for i, val in enumerate(loc) if val]


loc = loc.nonzero()[0]

and do elsif is_bool_indexer(loc)
then make the else raise an AssertionError (meaning we don't expect to hit this and its a bug)

nbonnotte · 2015-11-30T09:51:19Z

@jreback I go back here because I've spotted something really, really wrong, and I don't want my comment to disappear when the commit gets outdated.

For the completeness of the discussion, I'll recall you just defined two dataframes

In [39]: df1 = DataFrame([[1,3,4]],columns=pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c']))
In [41]: df2 = pd.DataFrame([[1,2,3,4]],columns=pd.MultiIndex.from_tuples([('a','a1'),('a','a2'),('b1','c1'),('b2','c2')],names=['b','c']))

This df1 is meant to be identical to the dataframe dg I introduced in my first message.

Except that it is not.

Issue 1: `df1` is not the same as `dg`!

On master, both df1.drop('a', axis=1) and df2.drop('a', axis=1) work. So does df1.groupby('a'), but not df2.groupby('a') since it is not 1-dimensional.

On the other hand, neither work for my weird dataframe dg obtained with .pivot_table and .reset.

Independently from what the API should or should not allow, there seems to be another bug: how is it possible that my dg and your df1 look the same, but do not behave the same?

Issue 2: what the API should or should not allow

As for the API, I'm not sure I understand what it should be.

What should, or should not, work, in the two following cases?

df1.groupby('a') and dg.groupby('a') (assuming df1 and dg are the same)
df1.drop('a', axis=1) and dg.drop('a', axis=1)

I don't think either would make the API more confusing:

it would seem natural to me that df1.drop('a', axis=1) work the same as df1.drop('a', axis=1, level=0)
it would also seem natural to me that df1.groupby('a') work if and only if df1.groupby(df1['a']) work

In short, I'm in favor of just solving the first issue, that is of making dg work exactly as df1 work now.

jreback · 2015-11-30T11:46:55Z

@nbonnotte you are making this discussion about too many things.
df1.drop('a',axis=1) the same as df1.drop('a',axis=1,level=0) is a much larger discussion and will not be handled here.

jreback · 2015-11-30T11:48:48Z

show a very specific copy-pastable example of where the constructed dg and df1 are not the same (not they are NOT exactly the same as the levels of the columns are different; this is expected).

nbonnotte · 2015-11-30T14:43:15Z

dg = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]]).pivot_table(index='a', columns=['b','c'], values='d').reset_index()
df1 = pd.DataFrame([[1,3,4]],columns=pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c']))

If this discussion goes in too many directions, it's because I'm a bit confused about the direction to take.

What exactly should be the result of this pull request?

It can solve a little bug in pandas/core/index.py, with the consequence that dg will behave the same regarding to .drop and .groupby as how df1 already behaves today. This is what the PR in its present state does. The resulting API may be confusing, but it will not be more confusing that what it is today.
An exception with a clear message can be raised for dg.drop('a', axis=1) and dg.groupby('a'). It will allow nothing new compared with today, but the error message will be more explicit. But then, dg and df1, which are not the same but still for all intents and purpose are the same, will keep not behaving the same.
An exception can be raised both for df1 and dg. Then, some code that currently work for dg will not work any more. This is problematic, and I just mention this possibility to list all alternatives.

Should I go for 2) then?

jreback · 2015-12-01T02:45:01Z

So this is a bit more complicated. I believe the difference is because of the lex sorting depth

In [1]: df1 = pd.DataFrame([[1,3,4]],columns=pd.MultiIndex.from_tuples([('a',''),('b1','c1'),('b2','c2')],names=['b','c']))

In [4]: result = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]]).pivot_table(index='a', columns=['b','c'], values='d').reset_index()

In [5]: result.columns.lexsort_depth
Out[5]: 0

In [7]: df1.columns.lexsort_depth
Out[7]: 2
In [16]: pd.MultiIndex.from_tuples(result.columns.values)
Out[16]: 
MultiIndex(levels=[[u'a', u'b1', u'b2'], [u'', u'c1', u'c2']],
           labels=[[0, 1, 2], [0, 1, 2]])

If the index is rebuilt it is the same.

In [17]: pd.MultiIndex.from_tuples(result.columns.values).lexsort_depth
Out[17]: 2

Of this said, I don't actually think what I am saying here is a bug. These are both equally valid indexes.

So I have to think about this some more.

nbonnotte · 2016-01-16T08:18:27Z

@jreback have you had the time to think about this?

If there are doubts, I'd suggest going for my option 2, which is conservative:

An exception with a clear message can be raised for dg.drop('a', axis=1) and dg.groupby('a'). It will allow nothing new compared with today, but the error message will be more explicit. But then, dg and df1, which are not the same but still for all intents and purpose are the same, will keep not behaving the same.

so that we have a nice error message. Then we can create another issue about the (already existing) lack of consistency in the API.

jreback · 2016-01-16T17:32:30Z

pandas/core/index.py

                    inds.extend(lrange(loc.start, loc.stop))
+                elif is_bool_indexer(loc):


let's restrict this issue only to the match_axis_length issue

You're right.

I'll start an issue and a PR for the KeyError (bug 2 mentioned in the first message in this PR)

This last bug has since been reported as #11741

Solved with PR #12063

No KeyError was raised when grouping by a non-existant column Fixes #11741 Xref issue #11640, PR #11717

nbonnotte · 2016-01-17T10:05:56Z

@jreback I've rewritten the tests in such a way as to make the changes somewhat more compelling

I hope you'll be convinced this does not make the API more confusing.

jreback · 2016-01-17T19:06:48Z

@nbonnotte my point is this is not a bug

In [30]: df = pd.DataFrame(columns=['a', 'b', 'c', 'd'],data=[[1, 'b1', 'c1', 3],[1, 'b2', 'c2', 4]])

In [31]: df
Out[31]: 
   a   b   c  d
0  1  b1  c1  3
1  1  b2  c2  4

In [32]: df2 = df.pivot_table(index='a', columns=['b', 'c'], values='d')

In [33]: df2
Out[33]: 
b b1 b2
c c1 c2
a      
1  3  4

# this is exactly what it should do, 'a' is NOT in the index
In [34]: df2.groupby('a').mean()
KeyError: 'a'

In [35]: df2.groupby('a',level=0).mean()
Out[35]: 
b b1 b2
c c1 c2
a      
1  3  4

jreback · 2016-01-17T19:09:11Z

In [36]: df = pd.DataFrame(columns=['a','b','c','d'], data=[[1,'b1','c1',3], [1,'b2','c2',4]])

In [37]: 

In [37]: In [3]: df = df.pivot_table(index='a', columns=['b','c'], values='d').reset_index()

In [39]: df.groupby('a').mean()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-a830c6135818> in <module>()
----> 1 df.groupby('a').mean()

/Users/jreback/pandas/pandas/core/groupby.pyc in mean(self)
    908             self._set_selection_from_grouper()
    909             f = lambda x: x.mean(axis=self.axis)
--> 910             return self._python_agg_general(f)
    911 
    912     @Substitution(name='groupby')

/Users/jreback/pandas/pandas/core/groupby.pyc in _python_agg_general(self, func, *args, **kwargs)
    751                 output[name] = self._try_cast(values[mask], result)
    752 
--> 753         return self._wrap_aggregated_output(output)
    754 
    755     def _wrap_applied_output(self, *args, **kwargs):

/Users/jreback/pandas/pandas/core/groupby.pyc in _wrap_aggregated_output(self, output, names)
   3511     def _wrap_aggregated_output(self, output, names=None):
   3512         agg_axis = 0 if self.axis == 1 else 1
-> 3513         agg_labels = self._obj_with_exclusions._get_axis(agg_axis)
   3514 
   3515         output_keys = self._decide_output_index(output, agg_labels)

/Users/jreback/pandas/pandas/core/groupby.pyc in __getattr__(self, attr)
    457 
    458         raise AttributeError("%r object has no attribute %r" %
--> 459                              (type(self).__name__, attr))
    460 
    461     plot = property(GroupByPlot)

AttributeError: 'DataFrameGroupBy' object has no attribute '_obj_with_exclusions'

In [40]: df.groupby('a',level=0).mean()
Out[40]: 
b  a b1 b2
c    c1 c2
0  1  3  4

from your original issue [39] should simply give a more informative errors message (and raise a KeyError), not actually try to fix the issue when the level is not specified, which would be a major API change.

nbonnotte · 2016-01-17T19:21:17Z

@jreback my point is that, in the tests I wrote, df and reference are the same, in the sense of assert_frame_equal(df, reference), and yet today the same procedure raises an error for df and not for reference.

I'd be very happy to replace that error with a more explicit one, and start a new issue about the lack of consistency of the API. But first I just wanted to be sure things were clear enough ^^

jreback · 2016-01-17T19:36:17Z

you cannot change index code like that

just address the incorrect error message in the original issue

one issue per pr

nbonnotte · 2016-01-17T21:23:33Z

Here we go.

I don't know if I've chosen the best place to raise a meaningful error, but like this it should at least be conservative (that is, not break nor enable new things).

Likewise, I'll be glad to have any feedback on the comments. I hope they're clear enough.

jreback · 2016-01-17T21:29:04Z

no, don't try to fix anything in indexing. This is purely a groupby issue. You are still trying to make something that is an API change work.

jreback · 2016-01-17T21:29:15Z

remove all of the indexing code that you added.

nbonnotte · 2016-01-17T22:14:27Z

I'm not changing anything about the API. The result is just a better error message.

If I remove my changes in the indexing code, with some .groupby and .drop operations the method .get_loc() returns a boolean mask, which is not handled. So we get an AttributeError, which (when it is a .groupby) is caught and then stuff happen that should not, resulting in another obscur error.

So do you want me to instead catch that last obscur error at the top level, and replace it with a nice message?

Then I can create two other issues: one for .drop, and one for the inconsistency of the API.

jreback · 2016-01-17T22:25:23Z

this is a much simpler change

just step thru and catch it where it happens

nbonnotte · 2016-01-18T12:03:52Z

Ok.

But I would feel more comfortable working on this issue once the .drop problem has been fixed, so I'll start directly a new issue for that.

jreback · 2016-01-18T12:13:31Z

the 'drop' issue is an API change that would need much more testing and review

nbonnotte · 2016-01-18T12:18:58Z

I understand. I'm just talking about the obscur error message that happens in some cases when dropping, see #12078

closes #11640

nbonnotte · 2016-01-29T12:17:49Z

Following PR #12158, the bug has disappeared.

This pull request now only contains tests.

If there are problems with the API, I'd be happy to throw errors where need be. But maybe that should be for another issue?

closes pandas-dev#11640 closes pandas-dev#11717

jreback · 2016-01-29T14:41:09Z

@nbonnotte thanks!

jreback added Groupby Error Reporting Incorrect or improved errors from pandas labels Nov 29, 2015

jreback added this to the 0.18.0 milestone Nov 29, 2015

jreback reviewed Nov 29, 2015
View reviewed changes

jreback modified the milestones: Next Major Release, 0.18.0 Nov 30, 2015

jreback reviewed Jan 16, 2016
View reviewed changes

nbonnotte mentioned this pull request Jan 16, 2016

BUG in .groupby for single-row DF #12063

Closed

jreback pushed a commit that referenced this pull request Jan 17, 2016

BUG in .groupby for single-row DF, #11741

e9e8598

No KeyError was raised when grouping by a non-existant column Fixes #11741 Xref issue #11640, PR #11717

nbonnotte mentioned this pull request Jan 18, 2016

Obscur AttributeError when dropping on a multi-index dataframe #12078

Closed

TST drop and groupby on dataframes with non-lexsorted multi-index

3216b6d

closes #11640

nbonnotte changed the title ~~BUG in .drop and .groupby for dataframes with multi-indexed columns~~ TST in .drop and .groupby for dataframes with multi-indexed columns Jan 29, 2016

jreback added this to the 0.18.0 milestone Jan 29, 2016

jreback pushed a commit to jreback/pandas that referenced this pull request Jan 29, 2016

TST drop and groupby on dataframes with non-lexsorted multi-index

001dab6

closes pandas-dev#11640 closes pandas-dev#11717

jreback closed this in b291dd6 Jan 29, 2016

nbonnotte deleted the attribute-error-11640 branch January 29, 2016 14:49

austinauyeung mentioned this pull request Aug 1, 2023

BUG: pivot_table mean of integer input casted back to int #54263

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TST in .drop and .groupby for dataframes with multi-indexed columns #11717

TST in .drop and .groupby for dataframes with multi-indexed columns #11717

nbonnotte commented Nov 28, 2015

jreback Nov 29, 2015

nbonnotte commented Nov 30, 2015

jreback commented Nov 30, 2015

jreback commented Nov 30, 2015

nbonnotte commented Nov 30, 2015

jreback commented Dec 1, 2015

nbonnotte commented Jan 16, 2016

jreback Jan 16, 2016

nbonnotte Jan 16, 2016

nbonnotte Jan 16, 2016

nbonnotte Jan 16, 2016

nbonnotte commented Jan 17, 2016

jreback commented Jan 17, 2016

jreback commented Jan 17, 2016

nbonnotte commented Jan 17, 2016

jreback commented Jan 17, 2016

nbonnotte commented Jan 17, 2016

jreback commented Jan 17, 2016

jreback commented Jan 17, 2016

nbonnotte commented Jan 17, 2016

jreback commented Jan 17, 2016

nbonnotte commented Jan 18, 2016

jreback commented Jan 18, 2016

nbonnotte commented Jan 18, 2016

nbonnotte commented Jan 29, 2016

jreback commented Jan 29, 2016

		inds.extend(lrange(loc.start, loc.stop))
		elif is_bool_indexer(loc):

TST in .drop and .groupby for dataframes with multi-indexed columns #11717

TST in .drop and .groupby for dataframes with multi-indexed columns #11717

Conversation

nbonnotte commented Nov 28, 2015

Bug 1 [edit: issues #11640 and #12078, fixed by PR #12158 but not entirely tested]

Bug 2 [edit: #11741, solved by PR #12063]

jreback Nov 29, 2015

Choose a reason for hiding this comment

nbonnotte commented Nov 30, 2015

Issue 1: df1 is not the same as dg!

Issue 2: what the API should or should not allow

jreback commented Nov 30, 2015

jreback commented Nov 30, 2015

nbonnotte commented Nov 30, 2015

jreback commented Dec 1, 2015

nbonnotte commented Jan 16, 2016

jreback Jan 16, 2016

Choose a reason for hiding this comment

nbonnotte Jan 16, 2016

Choose a reason for hiding this comment

nbonnotte Jan 16, 2016

Choose a reason for hiding this comment

nbonnotte Jan 16, 2016

Choose a reason for hiding this comment

nbonnotte commented Jan 17, 2016

jreback commented Jan 17, 2016

jreback commented Jan 17, 2016

nbonnotte commented Jan 17, 2016

jreback commented Jan 17, 2016

nbonnotte commented Jan 17, 2016

jreback commented Jan 17, 2016

jreback commented Jan 17, 2016

nbonnotte commented Jan 17, 2016

jreback commented Jan 17, 2016

nbonnotte commented Jan 18, 2016

jreback commented Jan 18, 2016

nbonnotte commented Jan 18, 2016

nbonnotte commented Jan 29, 2016

jreback commented Jan 29, 2016

Issue 1: `df1` is not the same as `dg`!