Skip to content

API: Clarify difference between agg and apply for Series / DataFrame #49673

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rhshadrach opened this issue Nov 12, 2022 · 2 comments
Open
Labels
API Design Apply Apply, Aggregate, Transform, Map DataFrame DataFrame data structure Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Series Series data structure

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Nov 12, 2022

Currently the use of Series.apply / Series.agg and DataFrame.apply / DataFrame.agg is confusing. In particular, sometimes the user calls apply and gets the results of agg or vice-versa:

  • apply with list- or dict-like arguments calls agg.
  • DataFrame.agg with a UDF calls DataFrame.apply.
  • Series.agg with a UDF calls Series.apply, and if this fails, attempts to pass the Series to the UDF.

If we are to change the current behavior, it will need to go through deprecation. This will be a bit tricky with the way the code paths switch between agg and apply, but I believe it can be done (see #49672 (comment)).

In order to clarify the difference between agg and apply for users, I propose the following for a single argument:

  • (unchanged) DataFrame.apply will apply the function to each Series, the result shape will be inferred from the output.
  • (unchanged) DataFrame.applymap will apply the function to each cell.
  • (unchanged) Series.apply will apply the function to each row.
  • (changed) DataFrame.agg will act on each Series that makes up the DataFrame, the result will always be a Series. Currently the result shape is inferred from the output.
  • (changed) Series.agg will act on the Series, and the result will be whatever the return is. Currently apply is tried first and only when that fails will agg act on the Series.

And for multiples:

  • (changed) When given a list-like or dict-like, agg will call agg for each argument and apply will call apply. Currently apply will call agg in this case.

I've put up #49672 to show the implementation and the impact on our tests. Some examples:

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

# For reducers, apply and agg act the same on a DataFrame

print(df.apply(str))
# a    0    1\n1    2\n2    3\nName: a, dtype: int64
# b    0    4\n1    5\n2    6\nName: b, dtype: int64
# dtype: object

print(df.agg(str))
# a    0    1\n1    2\n2    3\nName: a, dtype: int64
# b    0    4\n1    5\n2    6\nName: b, dtype: int64
# dtype: object

# apply sees a Series output as not being a reducer, combines results with `concat(..., axis=1)`
# agg treats everything as a reducer. The result is a Series whose entries are themselves Series.

print(df.apply(lambda x: pd.concat([x, x])))
#    a  b
# 0  1  4
# 1  2  5
# 2  3  6
# 0  1  4
# 1  2  5
# 2  3  6

print(df.agg(lambda x: pd.concat([x, x])))
# a    0    1
# 1    2
# 2    3
# 0    1
# 1    2
# 2    3
# Name...
# b    0    4
# 1    5
# 2    6
# 0    4
# 1    5
# 2    6
# Name...
# dtype: object

# apply sees list output as not being a reducer, makes them into columns of the result (a no-op in this case)
# agg treats everything as a reducer

print(df.apply(lambda x: list(x)))
#    a  b
# 0  1  4
# 1  2  5
# 2  3  6

print(df.agg(lambda x: list(x)))
# a    [1, 2, 3]
# b    [4, 5, 6]
# dtype: object
@rhshadrach rhshadrach added API Design Needs Discussion Requires discussion from core team before further action Apply Apply, Aggregate, Transform, Map DataFrame DataFrame data structure Series Series data structure Deprecate Functionality to remove in pandas labels Nov 12, 2022
@topper-123
Copy link
Contributor

topper-123 commented Nov 13, 2022

I very much agree. Having df.aggalways return a Series if fund is a single argument would be a win in terms of users understanding their code and it makes the internal organization of the pandas code base clearer, that's a double win:-).

@jorisvandenbossche
Copy link
Member

One tangential note: with DataFrame.apply working on each column/Series, and DataFrame.applymap working on each column/Series row element, you would also say that the equivalent for applymap for Series is Series.map and not Series.apply.

(in any case, I have always found Series.apply vs Series.map difference also confusing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Apply Apply, Aggregate, Transform, Map DataFrame DataFrame data structure Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Series Series data structure
Projects
None yet
Development

No branches or pull requests

3 participants