-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TST in .drop and .groupby for dataframes with multi-indexed columns #11717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
inds.extend(lrange(loc.start, loc.stop)) | ||
else: # loc is a mask | ||
loc = [i for i, val in enumerate(loc) if val] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loc = loc.nonzero()[0]
and do elsif is_bool_indexer(loc)
then make the else
raise an AssertionError
(meaning we don't expect to hit this and its a bug)
@jreback I go back here because I've spotted something really, really wrong, and I don't want my comment to disappear when the commit gets outdated. For the completeness of the discussion, I'll recall you just defined two dataframes
This Except that it is not. Issue 1:
|
@nbonnotte you are making this discussion about too many things. |
show a very specific copy-pastable example of where the constructed |
If this discussion goes in too many directions, it's because I'm a bit confused about the direction to take. What exactly should be the result of this pull request?
Should I go for 2) then? |
So this is a bit more complicated. I believe the difference is because of the lex sorting depth
If the index is rebuilt it is the same.
Of this said, I don't actually think what I am saying here is a bug. These are both equally valid indexes. So I have to think about this some more. |
@jreback have you had the time to think about this? If there are doubts, I'd suggest going for my option 2, which is conservative:
so that we have a nice error message. Then we can create another issue about the (already existing) lack of consistency in the API. |
inds.extend(lrange(loc.start, loc.stop)) | ||
elif is_bool_indexer(loc): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's restrict this issue only to the match_axis_length
issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right.
I'll start an issue and a PR for the KeyError (bug 2 mentioned in the first message in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This last bug has since been reported as #11741
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved with PR #12063
@jreback I've rewritten the tests in such a way as to make the changes somewhat more compelling I hope you'll be convinced this does not make the API more confusing. |
@nbonnotte my point is this is not a bug
|
from your original issue |
@jreback my point is that, in the tests I wrote, I'd be very happy to replace that error with a more explicit one, and start a new issue about the lack of consistency of the API. But first I just wanted to be sure things were clear enough ^^ |
you cannot change index code like that just address the incorrect error message in the original issue one issue per pr |
Here we go. I don't know if I've chosen the best place to raise a meaningful error, but like this it should at least be conservative (that is, not break nor enable new things). Likewise, I'll be glad to have any feedback on the comments. I hope they're clear enough. |
no, don't try to fix anything in indexing. This is purely a groupby issue. You are still trying to make something that is an API change work. |
remove all of the indexing code that you added. |
I'm not changing anything about the API. The result is just a better error message. If I remove my changes in the indexing code, with some So do you want me to instead catch that last obscur error at the top level, and replace it with a nice message? Then I can create two other issues: one for |
this is a much simpler change just step thru and catch it where it happens |
Ok. But I would feel more comfortable working on this issue once the |
the 'drop' issue is an API change that would need much more testing and review |
I understand. I'm just talking about the obscur error message that happens in some cases when dropping, see #12078 |
Following PR #12158, the bug has disappeared. This pull request now only contains tests. If there are problems with the API, I'd be happy to throw errors where need be. But maybe that should be for another issue? |
@nbonnotte thanks! |
Bug 1 [edit: issues #11640 and #12078, fixed by PR #12158 but not entirely tested]
There was a bug deeply hidden in
pandas/core/index.py
, where it was assumed that the.get_loc
method of an index would only return an integer or a slice, while it can also return a mask.Simply correcting that bug has two consequences:
df
with multi-indexed columns,df.drop('a', axis=1)
works ifdf['a']
returns something, e.g. if the column('a', '')
existsdf.groupby('a')
also works in that caseExample:
Bug 2 [edit: #11741, solved by PR #12063]
I also correct a little bug I mentioned in the talk on issue #11640:
.groupby
failed to raise aKeyError
for a non-existing column when there was only one row:Now we do have: