-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: iterating on a subset of columns in a GroupBy object (#44821) #44947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…#44821) Fixes issue pandas-dev#44821. When trying to iterate on a subset of columns in a GroupBy object, it returned all columns, instead of the selected subset. GroupBy.__iter__ used self.obj instead of self._selected_obj (see PR pandas-dev#6570).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run the tests locally first. This broke lots of tests
Hi. Only one test is broken, which I mentioned here: #44821 (comment) As mentioned there, I'm not sure how to continue, as users may rely on current behavior. By the way, should I could continue the discussion in this PR? |
Fixes test due to changes in GroupBy.__iter__ (see pandas-dev#44821). As the column `c` wasn't selected on the manual computation, it failed when trying to set it as an index.
Well, I decided to fix that test. While this PR changes So, although this PR might break somebody's code, this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this is reasonable, can you add a whatsnew note in 1.4, groupby bug fix section. you may want to give a mini-example of what the change is
cc @rhshadrach for comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nitpick, otherwise I think this is good.
There is an oddity with _selected_obj, but I think this change is definitely an improvement over current behavior and it is really independent, so still +1. Once you subset the columns of a groupby (e.g. ['b']
or [['b']]
), the selected_obj will not include the grouper. However, before you do this, it will unless the context is active. I.e.
df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
gb = df.groupby('a')
print(list(gb.grouper.get_iterator(gb._selected_obj, axis=gb.axis)))
print()
with gb._group_selection_context():
print(list(gb.grouper.get_iterator(gb._selected_obj, axis=gb.axis)))
print()
selected_gb = gb[['b']]
print(list(selected_gb.grouper.get_iterator(selected_gb._selected_obj, axis=gb.axis)))
produces
[(1, a b c
0 1 3 5), (2, a b c
1 2 4 6)]
[(1, b c
0 3 5), (2, b c
1 4 6)]
[(1, b
0 3), (2, b
1 4)]
In the first print, the grouper 'a' is present. In the 2nd and 3rd it isn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
thanks @maurosilber |
Fixes issue #44821.
When trying to iterate on a subset of columns in a GroupBy object,
it returned all columns, instead of the selected subset.
GroupBy.iter used self.obj instead of self._selected_obj (see
PR #6570).