BUG: Groupby.agg with non-reducing UDF #51445

jbrockmendel · 2023-02-17T02:14:37Z

In GroupBy._python_agg_general we have a check

        if self.ngroups == 0:
            # e.g. test_evaluate_with_empty_groups different path gets different
            #  result dtype in empty case.
            return self._python_apply_general(f, self._selected_obj, is_agg=True)

The only test that fails if we disable this check is test_evaluate_with_empty_groups, which is passing lambda x: x as its ostensibly-aggregating function. Without this special-casing, we return object dtype instead of float64 dtype in this case. I'm not sure what our policy is on requiring actually-aggregating functions in agg (cc @rhshadrach?) but this feels sketchy.

The text was updated successfully, but these errors were encountered:

rhshadrach · 2023-02-20T04:30:37Z

I'm not sure what our policy is on requiring actually-aggregating functions in agg (cc @rhshadrach?)

Because apply and agg share code paths in various cases, and apply is allowed any UDF, we currently allow it in agg in some places as well. This should be addressed by #49673. My current view is that we should allow UDFs that don't return scalars in agg, but still treat it as an aggregation. An example where this is useful:

df = DataFrame({"a": [1, 1, 2, 2, 2], "b": range(5)})
gb = df.groupby("a")
result = gb.agg(list)
print(result)
#            b
# a           
# 1     [0, 1]
# 2  [2, 3, 4]

Replacing list with np.array might also be useful, but we currently raise ValueError: Must produce aggregated value there.

jbrockmendel · 2023-02-20T21:38:44Z

thanks for clarifying. should we close this or classify it as "closed by #49673" or something else?

rhshadrach · 2023-02-22T03:54:22Z

Removing the block highlighted in the OP, I'm not seeing any tests fail on main. This logic seems to be repeated after the for loop iterating over slices:

pandas/pandas/core/groupby/generic.py

Lines 1354 to 1356 in 3f3102b

    
           if not output: 
        
               # e.g. test_margins_no_values_no_cols 
        
               return self._python_apply_general(f, self._selected_obj)

The only difference is the lack of passing is_agg=True, but this argument is not utilized in _python_apply_general.

I'm guessing the case after the for loop might have been useful when failure was allowed, but should not longer be necessary. But it makes sense to me to have one of the checks remain to have apply and agg return the same result in the empty case.

jbrockmendel · 2023-02-22T04:04:25Z

The code in the OP used to be shared by SeriesGroupBy._python_agg_general and DataFrameGroupBy._python_agg_general until last week, now is only used by DFGB. The SGB case actually did need it, but it got refactored out to be more explicit. So removing it from DFGB._python_agg_general doesn't break any tests, but plausibly could. I wouldn't mind ripping it out anyway.

The if not output check should become unnecessary after your PR changing DFGB._iterate_slices.

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member Groupby Apply Apply, Aggregate, Transform, Map and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 17, 2023

jbrockmendel mentioned this issue Mar 16, 2023

BUG (2.0rc0): groupby apply with a UDF changes dtype unexpectedly from double[pyarrow] to float64 #51991

Open

3 tasks

rhshadrach mentioned this issue Apr 13, 2023

BUG: "unique" and pd.Series.unique produce different results with aggregation #39920

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Groupby.agg with non-reducing UDF #51445

BUG: Groupby.agg with non-reducing UDF #51445

jbrockmendel commented Feb 17, 2023 •

edited

Loading

rhshadrach commented Feb 20, 2023 •

edited

Loading

jbrockmendel commented Feb 20, 2023

rhshadrach commented Feb 22, 2023

jbrockmendel commented Feb 22, 2023

BUG: Groupby.agg with non-reducing UDF #51445

BUG: Groupby.agg with non-reducing UDF #51445

Comments

jbrockmendel commented Feb 17, 2023 • edited Loading

rhshadrach commented Feb 20, 2023 • edited Loading

jbrockmendel commented Feb 20, 2023

rhshadrach commented Feb 22, 2023

jbrockmendel commented Feb 22, 2023

jbrockmendel commented Feb 17, 2023 •

edited

Loading

rhshadrach commented Feb 20, 2023 •

edited

Loading