Skip to content

API: Should apply be smart? #39209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rhshadrach opened this issue Jan 16, 2021 · 10 comments
Open

API: Should apply be smart? #39209

rhshadrach opened this issue Jan 16, 2021 · 10 comments
Labels
API Design Apply Apply, Aggregate, Transform, Map Refactor Internal refactoring of code

Comments

@rhshadrach
Copy link
Member

From #34998 (comment)

..., this got me a bit thinking about the general issue we are tackling here: should apply() be smart and try to infer what kind of function is being applied? Or should it not try to be smart and always do the same (eg always prepend groups by default).

For the specific case where the applied function is a transformation (preserving identical index), the consequence of the changes we decided to do in this PR, is that we want apply to no longer be smart.

But currently (and this is long-time behaviour), apply tries to be smart and this is also documented. In the "flexible apply" section in the docs, there is a note that says:

apply can act as a reducer, transformer, or filter function, depending on exactly what is passed to it.
So depending on the path taken, and exactly what you are grouping. Thus the grouped column(s) may be included in
the output as well as set the indices.

(So if we go forward as is with this PR, that note should actually be updated.)

The consequence of this PR is that apply will no longer try to detect if it acted as a transformer or not (at least by default with group_keys=True, as group_keys=False still gives this reindexing behaviour for a transformer as discussed above).

But if we no longer detect the transformer case, should we then also stop detecting the reducer case?

IMO, groupby.apply should not be a one-stop-shop for all your UDF needs. If users want to use a reducer, use agg; if they want to use a transformer, use transform; if they want to use a filter, use filter. The role of apply, to me, should be for UDFs that don't fit into these roles. In this sense, I think apply should not go out of its way to generate more convenient/sensible output for these classes of UDFs.

The result of a groupby is a mapping from each group to a python object. Whatever behavior is decided for apply, it should be well-defined and entire with no more assumptions than that. Here is a naive attempt at doing so.

  • If all constituent results are Series/DataFrames, concatenate results together.
  • Otherwise, the result is a Series where each value is what was returned from the UDF; dtype is inferred.
  • keyword-only arg on apply (not groupby) to control if the groups are added to the index, columns, or neither.

Perhaps numpy-specific support should be added in. Probably things need to be added to support UDFs that reach groupby.apply via resample/window, but I'm not familiar enough with that code to reason about it.

@rhshadrach rhshadrach added Bug Needs Triage Issue that has not been reviewed by a pandas team member API Design Needs Discussion Requires discussion from core team before further action and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 16, 2021
@mroeschke
Copy link
Member

Just for reference, rolling.apply and groupby.rolling.apply only accepts reducers (but in general the rolling operation acts as transform)

+1 in general for making groupby.apply be "less smart". Maybe to aid the discussion, might be good to list the behaviors of groupby.apply that should be deprecated because its redundant with other methods.

@TomAugspurger
Copy link
Contributor

And it's worth discussing if we should provide alternative APIs for one of .apply's strength: that it operates on DataFrames rather than columns. Would we be interested in something like DataFrameGroupBy.agg_table, where the callable takes a DataFrame an aggregated result? And a .transform_table?

@jreback
Copy link
Contributor

jreback commented Jan 17, 2021

-1 on any additional api
we already have too much
+1 on stricter apply

@rhshadrach
Copy link
Member Author

@TomAugspurger DataFrame.transform may also work on frames:

    # Two possible ways to use a UDF - apply or call directly
    try:
        return obj.apply(func, args=args, **kwargs)
    except Exception:
        return func(obj, *args, **kwargs)

It seems to me support for column and/or frame ops in apply/agg/transform/filter should be consistent. If we are to support both (and I think we should), it should be through user-passed arguments (e.g. by_column=True/False) or separate methods instead of inference (try-A-then-try-B as in the transform code above). This would be expanding the API, but my intent here is to make maintenance easier. IMO, try-A-then-try-B can be a source of regressions, performance degradation, makes it harder to reason about the code, and makes errors messages less helpful.

@mroeschke: Great idea - here are the two that I see:

  • Try operating on all columns, if this fails, try again after removing the grouping columns.
  • Shaping the result differently based on mutated/not_indexed_same (e.g. in groupby.ops.apply / groupby.generic._wrap_applied_output)

@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Refactor Internal refactoring of code and removed Needs Discussion Requires discussion from core team before further action labels Aug 15, 2021
@kwhkim
Copy link
Contributor

kwhkim commented Mar 31, 2022

I'd like to add my opinion in this discussion.

I am familiar with R and I understand pandas was originated from R(? or at least inspired by it).

R has function apply() which applies a function to a matrix and the function does not need to be a reducer.

The function can be reducer, transformer, extender... It generated confusion from novice and it was hard to debug.

So a package developer named Hadley Wickham came along and developed all sort of function that does essentially

the same functionality but throw out errors if the output is not the same format as specified...He made something like apply_for_reducer(), apply_for_transformer(), etc.

I'd like to give +1 for making .apply() dumber but I think you might consider introducing all sorts of .apply_reducer(), .apply_transformer() for debugging convenience and not touching anything for .apply() itself. (Anyhow I think .apply() itself needs some modification)

+1 for explicitly specifying whether the function will be applied to a DataFrame or to a Series. it is quite frustrating to predict it!

@rhshadrach
Copy link
Member Author

I'd like to give +1 for making .apply() dumber but I think you might consider introducing all sorts of .apply_reducer(), .apply_transformer() for debugging convenience and not touching anything for .apply() itself. (Anyhow I think .apply() itself needs some modification)

pandas already has .aggregate and .transform, do these capture what you're suggesting?

@jreback
Copy link
Contributor

jreback commented Mar 31, 2022

these already exist

.agg and .transform

@kwhkim
Copy link
Contributor

kwhkim commented Mar 31, 2022

Yes, if usage of .agg() is restricted to reducers...

Currently, no. we can use .agg() for transformers.

@kwhkim
Copy link
Contributor

kwhkim commented Apr 2, 2022

Here is an example that .agg() works for transformers too

df = pandas.DataFrame(
    {
        "A": ['a', 'a', 'b', 'b', 'b'],
        "B": [0,1,1,1,0],
        "C": [1, 2, 3, 4, 5],
        "D": [1, 2, 3, 4, 5],
    }
)
df.agg(lambda x: x)
##    A  B  C  D
## 0  a  0  1  1
## 1  a  1  2  2
## 2  b  1  3  3
## 3  b  1  4  4
## 4  b  0  5  5

Compare the above with the result below

df.groupby(["A", "B"]).agg(lambda x: x)
##           C       D
## A B                
## a 0       1       1
##   1       2       2
## b 0       5       5
##   1  [3, 4]  [3, 4]

One thing I do not understand is why DataFrame.method() and DataFrame.groupby().method() differs so much...

Isn't it just that applying DataFrame.method() for each group resultsin DataFrame.groupby().method()?

@rhshadrach
Copy link
Member Author

@kwhkim - see #35725

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Apply Apply, Aggregate, Transform, Map Refactor Internal refactoring of code
Projects
None yet
Development

No branches or pull requests

5 participants