Skip to content

BUG: iterating on a subset of columns in a GroupBy object (#44821) #44947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 23, 2021

Conversation

maurosilber
Copy link
Contributor

@maurosilber maurosilber commented Dec 17, 2021

Fixes issue #44821.

When trying to iterate on a subset of columns in a GroupBy object,
it returned all columns, instead of the selected subset.

GroupBy.iter used self.obj instead of self._selected_obj (see
PR #6570).

…#44821)

Fixes issue pandas-dev#44821.

When trying to iterate on a subset of columns in a GroupBy object,
it returned all columns, instead of the selected subset.

GroupBy.__iter__ used self.obj instead of self._selected_obj (see
PR pandas-dev#6570).
Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run the tests locally first. This broke lots of tests

@maurosilber
Copy link
Contributor Author

maurosilber commented Dec 19, 2021

Hi.

Only one test is broken, which I mentioned here: #44821 (comment)

As mentioned there, I'm not sure how to continue, as users may rely on current behavior.

By the way, should I could continue the discussion in this PR?

Fixes test due to changes in GroupBy.__iter__ (see pandas-dev#44821).

As the column `c` wasn't selected on the manual computation,
it failed when trying to set it as an index.
@maurosilber
Copy link
Contributor Author

Well, I decided to fix that test. While this PR changes GroupBy.__iter__, it didn't actually broke any test that tested its behavior. The test that was previously failing concerned the behaviour of GroupBy.rolling, and the fix was in the "manual computation" to compare against.

So, although this PR might break somebody's code, this GroupBy behavior wasn't "promised" anywhere, and it could have been changed as part of #6570.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is reasonable, can you add a whatsnew note in 1.4, groupby bug fix section. you may want to give a mini-example of what the change is

@jreback
Copy link
Contributor

jreback commented Dec 19, 2021

cc @rhshadrach for comments

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick, otherwise I think this is good.

There is an oddity with _selected_obj, but I think this change is definitely an improvement over current behavior and it is really independent, so still +1. Once you subset the columns of a groupby (e.g. ['b'] or [['b']]), the selected_obj will not include the grouper. However, before you do this, it will unless the context is active. I.e.

df = pd.DataFrame({'a': [1, 2], 'b': [3, 4], 'c': [5, 6]})
gb = df.groupby('a')
print(list(gb.grouper.get_iterator(gb._selected_obj, axis=gb.axis)))
print()
with gb._group_selection_context():
    print(list(gb.grouper.get_iterator(gb._selected_obj, axis=gb.axis)))
    print()
selected_gb = gb[['b']]
print(list(selected_gb.grouper.get_iterator(selected_gb._selected_obj, axis=gb.axis)))

produces

[(1,    a  b  c
0  1  3  5), (2,    a  b  c
1  2  4  6)]

[(1,    b  c
0  3  5), (2,    b  c
1  4  6)]

[(1,    b
0  3), (2,    b
1  4)]

In the first print, the grouper 'a' is present. In the 2nd and 3rd it isn't.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback jreback added this to the 1.4 milestone Dec 23, 2021
@jreback jreback merged commit 010989d into pandas-dev:master Dec 23, 2021
@jreback
Copy link
Contributor

jreback commented Dec 23, 2021

thanks @maurosilber

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Inconsistent behaviour in DataFrameGroupBy when selecting a subset of columns
4 participants