Skip to content

CLN: _wrap_applied_output #35412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

rhshadrach
Copy link
Member

The majority of this change is the result of three operations:

  1. Move code that returns up top, to avoid the heavily nested structure.
  2. Move variable creation as close as possible to where they are used. Previously, in certain cases computations were being done and then going unused.
  3. Combine duplicated code.

There are two sections that I was able to entirely remove:

ping = self.grouper.groupings[0]
if len(keys) == ping.ngroups:
    key_index = ping.group_index
    key_index.name = key_names[0]

    key_lookup = Index(keys)
    indexer = key_lookup.get_indexer(key_index)

    # reorder the values
    values = [values[i] for i in indexer]

    # update due to the potential reorder
    first_not_none = next(com.not_none(*values), None)

and

# GH5788 instead of stacking; concat gets the
# dtypes correct
from pandas.core.reshape.concat import concat

result = concat(
    values,
    keys=key_index,
    names=key_index.names,
    axis=self.axis,
).unstack()
result.columns = index

For the single test that I touched, it was checking that the categorical dtype of an index was being dropped after a groupby. I don't believe that is the correct behavior - that the categorical dtype should remain. I checked groupby/categorical bugs and didn't find any issues this closes.

@rhshadrach rhshadrach added Apply Apply, Aggregate, Transform, Map Clean Groupby labels Jul 30, 2020
# update due to the potential reorder
first_not_none = next(com.not_none(*values), None)
else:
if not isinstance(v, (np.ndarray, Index, Series)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this just be an else statement? Or are there more types we handle than these + NDFrame?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're suggesting something like:

if isinstance(v, (np.ndarray, Index, Series)):
    ...
else:
    ...

The reason I have opted not to do this is that the if-block is exceedingly long, whereas the else-block is quite short. Doing it this way would result in a more nested rather than flat structure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack - sorry, I see what you're saying now. Ignore my previous response, will investigate.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring this - these code paths as I'm sure you've noticed are thorny, so cleanups here are much appreciated

@@ -868,13 +868,15 @@ def test_apply_multi_level_name(category):
b = [1, 2] * 5
if category:
b = pd.Categorical(b, categories=[1, 2, 3])
expected_index = pd.CategoricalIndex([1, 2], categories=[1, 2, 3], name="B")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an open issue for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so. There is only one issue tagged with categorical, groupby, and apply which is not relevant. I also took a look through those tagged as categorical and groupby and didn't see anything either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a whatsnew for this? Something along the lines of groupby apply will now maintain a CategoricalIndex (assuming that is now the case)

@rhshadrach
Copy link
Member Author

Codecov is finding a lot in this method that isn't hit. I'm going to see what tests can be added or branches that can be removed.

@rhshadrach rhshadrach marked this pull request as draft August 14, 2020 15:48
@rhshadrach rhshadrach force-pushed the apply_output_cleanup branch 2 times, most recently from c871a9f to 5efa563 Compare August 15, 2020 12:35
@rhshadrach rhshadrach force-pushed the apply_output_cleanup branch from 5efa563 to 059405c Compare August 17, 2020 20:21
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool I think this is a nice refactor @jreback

@@ -868,13 +868,15 @@ def test_apply_multi_level_name(category):
b = [1, 2] * 5
if category:
b = pd.Categorical(b, categories=[1, 2, 3])
expected_index = pd.CategoricalIndex([1, 2], categories=[1, 2, 3], name="B")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a whatsnew for this? Something along the lines of groupby apply will now maintain a CategoricalIndex (assuming that is now the case)

@jbrockmendel
Copy link
Member

The majority of this change is the result of three operations:

To the extent that you can split this into independent pieces, it will be easier to review.

@rhshadrach rhshadrach closed this Aug 18, 2020
@rhshadrach rhshadrach mentioned this pull request Sep 1, 2020
5 tasks
@rhshadrach rhshadrach deleted the apply_output_cleanup branch September 10, 2020 00:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Clean Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants