Skip to content

API: Clarify difference between agg and apply for Series / DataFrame #49672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

rhshadrach
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Copy link
Contributor

@topper-123 topper-123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the changed tests imply changed behavior of apply or agg? If it does, this needs a whatsnew entry.


# caller can react
return None

def agg_udf(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "udf" here a shorthand for? Can you rename to something more explicit and/or add a doc string & type hints to help better understand this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User Defined Function; I'd prefer sticking with the name but completely agree it should be expanding upon in a docstring.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@topper-123
Copy link
Contributor

topper-123 commented Nov 13, 2022

Is DataFrame.agg able to have a return type hint now? I.e. return type Series|DataFrame with overloading so that if func is listlike a DataFrame is returned else a Series is returned?

@rhshadrach
Copy link
Member Author

Does the changed tests imply changed behavior of apply or agg? If it does, this needs a whatsnew entry.

Yes - it very much changes their behavior in various cases. I think one could make the argument that the inconsistency in the current behavior rises to the level of a bug, but because the difference between agg / apply was never made clear before, I think this behavior needs to be deprecated first before changing. This PR was meant just to demonstrate what the implementation of the proposal in #49673 would look like.

Is DataFrame.agg able to have a return type hint now? I.e. return type Series|DataFrame with overloading so that if func is listlike a DataFrame is returned else a Series is returned?

Yes - we can guarantee list-likes will return DataFrame and strings / callables will return Series.

@topper-123
Copy link
Contributor

Ok great, do you have a plan how to do the deprecation? Seems a bit difficult to do cleanly.

@rhshadrach
Copy link
Member Author

Ok great, do you have a plan how to do the deprecation? Seems a bit difficult to do cleanly.

Yes - I've thought about this and it is quite difficult. So I wouldn't be surprised if what I have come up with missing some aspect(s).

For a single argument:

  • For DataFrame.agg, we need to warn precisely when the result is a DataFrame. When the result is a Series, it's because the provided function is a aggregation, in which case agg and apply behave the same (now and in the future).
  • For Series.agg, we only need to warn with a UDF and in that case only when the internal call to Series.apply is successful.
  • For DataFrame.apply and Series.apply, there is no behavior change.

For multiple arguments, we will need to use warnings.catchwarnings(record=True) and interrogate what warnings have been generated for each function / column combination, summarizing them appropriately.

  • For Series.agg and DataFrame.agg, we warn precisely when there is at least one warning generated by the single function case. These methods currently use agg on each function / column combination, and that will remain unchanged, so only changes in each individual results will surface an overall change in the future.
  • For DataFrame.apply, we warn for a UDF.
    • Non-UDF case: this is currently used with agg and in the future will be used with apply. Behavior of non-UDFs with agg today will match behavior of non-UDFs with apply in the future.
    • UDF case: Currently apply calls agg, which proceeds to split up the DataFrame into Series, calling agg with each argument. With a UDF, agg then calls apply on the Series, which applies the UDF to each row of the Series separately. In the future, apply will apply the UDF to each column of the DataFrame. While for certain ops these can coincide, in general they will be different.
  • For Series.apply, we never warn.

If DataFrame.agg or Series.agg warns a user, we can tell them to switch to apply to keep current behavior. For multiples, it may be necessary for the user to split up their desired operation to avoid warnings. To keep the behavior of DataFrame.apply in the multiples case, the user will need to use DataFrame.applymap.

@topper-123
Copy link
Contributor

Ok, yes it is indeed difficult. You're saying that "For DataFrame.agg, we need to warn precisely when the result is a DataFrame...". Does this imply that things will keep working the old way, and users will just get a warning? How will users get to use the new (IMO better) way? A flag somewhere?

@rhshadrach
Copy link
Member Author

rhshadrach commented Nov 15, 2022

How will users get to use the new (IMO better) way? A flag somewhere?

I think for all ops, users will have a way of changing their code to retain current behavior that won't break when we make the change and we'll be able to warn them. I think that's sufficient for deprecation from a backwards compatibility point of view. Without a flag, they wouldn't have a way to get the new behavior. We could implement it with a flag - I'm not opposed to this, but also don't see it at necessary (might still be nice to do).

@mroeschke mroeschke added the Apply Apply, Aggregate, Transform, Map label Nov 16, 2022
@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Dec 17, 2022

from pandas import Series

result = Series(results, index=keys)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be using something._constructor here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could be using selected_obj._constructor_sliced

@simonjayhawkins
Copy link
Member

@rhshadrach is this still draft or ready for review. also some conflicts to resolve.

@rhshadrach
Copy link
Member Author

@simonjayhawkins - this was just to demonstrate what the changes look like for #49673; will need to deprecate first. I'll close this and put up one for the deprecation.

@rhshadrach rhshadrach closed this Feb 24, 2023
@rhshadrach rhshadrach deleted the aggs_all_the_way_down branch November 26, 2023 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants