Skip to content

REF: de-duplicate iterate_slices #51186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Feb 6, 2023 · 2 comments · Fixed by #51398
Closed

REF: de-duplicate iterate_slices #51186

jbrockmendel opened this issue Feb 6, 2023 · 2 comments · Fixed by #51398
Assignees
Labels
Clean Groupby Internals Related to non-user accessible pandas implementation
Milestone

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Feb 6, 2023

# From DataFrameGroupBy
    def _iterate_slices(self) -> Iterable[Series]:
        obj = self._selected_obj
        if self.axis == 1:
            obj = obj.T

        if isinstance(obj, Series) and obj.name not in self.exclusions:
            # Occurs when doing DataFrameGroupBy(...)["X"]
            yield obj
        else:
            for label, values in obj.items():
                if label in self.exclusions:
                    continue

                yield values

Iterating over self._selected_obj but then excluding self.exclusons seems a lot like just iterating over self._obj_with_exclusions. I checked that using obj_with_exclusions does break things, but didn't figure out why.

it'd be nice if we could simplify this, or at least document why we need a third thing. cc @rhshadrach

If it were just obj_with_exclusions, then in DataFrameGroupBy._indexed_output_to_ndframe we could just use self._obj_with_exclusions.columns instead of re-constructing an Index (i think this re-construction is largely leftover from a time when we dropped nuisance columns)

Expected Behavior

NA

Installed Versions

Replace this line with the output of pd.show_versions()

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 6, 2023
@rhshadrach rhshadrach added Groupby Internals Related to non-user accessible pandas implementation Clean and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 6, 2023
@rhshadrach rhshadrach self-assigned this Feb 9, 2023
@rhshadrach
Copy link
Member

rhshadrach commented Feb 15, 2023

I'm seeing 1 test in groupby fail due to this behavior

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 3, 4], 'c': [5, 6, 7]})
gb = df.groupby(['a', 'b'])
result = gb.sum()
print(result)
#       c
# a b    
# 1 3  11
# 2 4   7

gb2 = gb[['a', 'b', 'c']]
result2 = gb2.sum()
print(result2)
#      a  b   c
# a b          
# 1 3  2  6  11
# 2 4  2  4   7

In the 2nd example, 'a' and 'b' are in the exclusions but this is ignored since _obj_with_exclusions is determined via _selection in this case:

pandas/pandas/core/base.py

Lines 214 to 228 in 74d5c19

def _obj_with_exclusions(self):
if isinstance(self.obj, ABCSeries):
return self.obj
if self._selection is not None:
return self.obj._getitem_nocopy(self._selection_list)
if len(self.exclusions) > 0:
# equivalent to `self.obj.drop(self.exclusions, axis=1)
# but this avoids consolidating and making a copy
# TODO: following GH#45287 can we now use .drop directly without
# making a copy?
return self.obj._drop_axis(self.exclusions, axis=1, only_slice=True)
else:
return self.obj

When using .sum(), we never hit _iterate_slices. Within agg, _iterate_slices causes inconsistent behavior:

result3 = gb2.agg(lambda x: x.sum())
print(result3)
#       c
# a b
# 1 3  11
# 2 4   7

To get consistent behavior, I think we can just replace _selected_obj with _obj_with_exclusions and remove any use of exclusions.

@jbrockmendel
Copy link
Member Author

nice! thanks for tracking this down

@rhshadrach rhshadrach added this to the 2.1 milestone Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Groupby Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants