API: Should apply be smart? #39209

rhshadrach · 2021-01-16T14:53:09Z

..., this got me a bit thinking about the general issue we are tackling here: should apply() be smart and try to infer what kind of function is being applied? Or should it not try to be smart and always do the same (eg always prepend groups by default).

For the specific case where the applied function is a transformation (preserving identical index), the consequence of the changes we decided to do in this PR, is that we want apply to no longer be smart.

But currently (and this is long-time behaviour), apply tries to be smart and this is also documented. In the "flexible apply" section in the docs, there is a note that says:

apply can act as a reducer, transformer, or filter function, depending on exactly what is passed to it.
So depending on the path taken, and exactly what you are grouping. Thus the grouped column(s) may be included in
the output as well as set the indices.

(So if we go forward as is with this PR, that note should actually be updated.)

The consequence of this PR is that apply will no longer try to detect if it acted as a transformer or not (at least by default with group_keys=True, as group_keys=False still gives this reindexing behaviour for a transformer as discussed above).

But if we no longer detect the transformer case, should we then also stop detecting the reducer case?

IMO, groupby.apply should not be a one-stop-shop for all your UDF needs. If users want to use a reducer, use agg; if they want to use a transformer, use transform; if they want to use a filter, use filter. The role of apply, to me, should be for UDFs that don't fit into these roles. In this sense, I think apply should not go out of its way to generate more convenient/sensible output for these classes of UDFs.

The result of a groupby is a mapping from each group to a python object. Whatever behavior is decided for apply, it should be well-defined and entire with no more assumptions than that. Here is a naive attempt at doing so.

If all constituent results are Series/DataFrames, concatenate results together.
Otherwise, the result is a Series where each value is what was returned from the UDF; dtype is inferred.
keyword-only arg on apply (not groupby) to control if the groups are added to the index, columns, or neither.

Perhaps numpy-specific support should be added in. Probably things need to be added to support UDFs that reach groupby.apply via resample/window, but I'm not familiar enough with that code to reason about it.

The text was updated successfully, but these errors were encountered:

mroeschke · 2021-01-16T23:21:50Z

Just for reference, rolling.apply and groupby.rolling.apply only accepts reducers (but in general the rolling operation acts as transform)

+1 in general for making groupby.apply be "less smart". Maybe to aid the discussion, might be good to list the behaviors of groupby.apply that should be deprecated because its redundant with other methods.

TomAugspurger · 2021-01-17T01:53:36Z

And it's worth discussing if we should provide alternative APIs for one of .apply's strength: that it operates on DataFrames rather than columns. Would we be interested in something like DataFrameGroupBy.agg_table, where the callable takes a DataFrame an aggregated result? And a .transform_table?

jreback · 2021-01-17T03:01:49Z

-1 on any additional api
we already have too much
+1 on stricter apply

rhshadrach · 2021-01-28T22:55:54Z

@TomAugspurger DataFrame.transform may also work on frames:

    # Two possible ways to use a UDF - apply or call directly
    try:
        return obj.apply(func, args=args, **kwargs)
    except Exception:
        return func(obj, *args, **kwargs)

It seems to me support for column and/or frame ops in apply/agg/transform/filter should be consistent. If we are to support both (and I think we should), it should be through user-passed arguments (e.g. by_column=True/False) or separate methods instead of inference (try-A-then-try-B as in the transform code above). This would be expanding the API, but my intent here is to make maintenance easier. IMO, try-A-then-try-B can be a source of regressions, performance degradation, makes it harder to reason about the code, and makes errors messages less helpful.

@mroeschke: Great idea - here are the two that I see:

Try operating on all columns, if this fails, try again after removing the grouping columns.
Shaping the result differently based on mutated/not_indexed_same (e.g. in groupby.ops.apply / groupby.generic._wrap_applied_output)

kwhkim · 2022-03-31T16:16:00Z

I'd like to add my opinion in this discussion.

I am familiar with R and I understand pandas was originated from R(? or at least inspired by it).

R has function apply() which applies a function to a matrix and the function does not need to be a reducer.

The function can be reducer, transformer, extender... It generated confusion from novice and it was hard to debug.

So a package developer named Hadley Wickham came along and developed all sort of function that does essentially

the same functionality but throw out errors if the output is not the same format as specified...He made something like apply_for_reducer(), apply_for_transformer(), etc.

I'd like to give +1 for making .apply() dumber but I think you might consider introducing all sorts of .apply_reducer(), .apply_transformer() for debugging convenience and not touching anything for .apply() itself. (Anyhow I think .apply() itself needs some modification)

+1 for explicitly specifying whether the function will be applied to a DataFrame or to a Series. it is quite frustrating to predict it!

rhshadrach · 2022-03-31T17:12:51Z

I'd like to give +1 for making .apply() dumber but I think you might consider introducing all sorts of .apply_reducer(), .apply_transformer() for debugging convenience and not touching anything for .apply() itself. (Anyhow I think .apply() itself needs some modification)

pandas already has .aggregate and .transform, do these capture what you're suggesting?

jreback · 2022-03-31T17:54:29Z

these already exist

.agg and .transform

kwhkim · 2022-03-31T18:53:44Z

Yes, if usage of .agg() is restricted to reducers...

Currently, no. we can use .agg() for transformers.

kwhkim · 2022-04-02T05:29:56Z

Here is an example that .agg() works for transformers too

df = pandas.DataFrame(
    {
        "A": ['a', 'a', 'b', 'b', 'b'],
        "B": [0,1,1,1,0],
        "C": [1, 2, 3, 4, 5],
        "D": [1, 2, 3, 4, 5],
    }
)
df.agg(lambda x: x)
##    A  B  C  D
## 0  a  0  1  1
## 1  a  1  2  2
## 2  b  1  3  3
## 3  b  1  4  4
## 4  b  0  5  5

Compare the above with the result below

df.groupby(["A", "B"]).agg(lambda x: x)
##           C       D
## A B                
## a 0       1       1
##   1       2       2
## b 0       5       5
##   1  [3, 4]  [3, 4]

One thing I do not understand is why DataFrame.method() and DataFrame.groupby().method() differs so much...

Isn't it just that applying DataFrame.method() for each group resultsin DataFrame.groupby().method()?

rhshadrach · 2022-04-02T16:27:34Z

@kwhkim - see #35725

rhshadrach added Bug Needs Triage Issue that has not been reviewed by a pandas team member API Design Needs Discussion Requires discussion from core team before further action and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 16, 2021

rhshadrach mentioned this issue Jan 16, 2021

API: User-control of result keys in GroupBy.apply #34998

Merged

rhshadrach mentioned this issue Apr 23, 2021

Higher Order Methods API #41112

Open

mroeschke added Apply Apply, Aggregate, Transform, Map Refactor Internal refactoring of code and removed Needs Discussion Requires discussion from core team before further action labels Aug 15, 2021

jorisvandenbossche mentioned this issue Nov 8, 2022

API: Consolidate groupby as_index and group_keys #49543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Should apply be smart? #39209

API: Should apply be smart? #39209

rhshadrach commented Jan 16, 2021

mroeschke commented Jan 16, 2021

TomAugspurger commented Jan 17, 2021

jreback commented Jan 17, 2021

rhshadrach commented Jan 28, 2021

kwhkim commented Mar 31, 2022

rhshadrach commented Mar 31, 2022

jreback commented Mar 31, 2022

kwhkim commented Mar 31, 2022

kwhkim commented Apr 2, 2022 •

edited

Loading

rhshadrach commented Apr 2, 2022

API: Should apply be smart? #39209

API: Should apply be smart? #39209

Comments

rhshadrach commented Jan 16, 2021

mroeschke commented Jan 16, 2021

TomAugspurger commented Jan 17, 2021

jreback commented Jan 17, 2021

rhshadrach commented Jan 28, 2021

kwhkim commented Mar 31, 2022

rhshadrach commented Mar 31, 2022

jreback commented Mar 31, 2022

kwhkim commented Mar 31, 2022

kwhkim commented Apr 2, 2022 • edited Loading

rhshadrach commented Apr 2, 2022

kwhkim commented Apr 2, 2022 •

edited

Loading