Skip to content

BUG: Groupby.agg with non-reducing UDF #51445

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jbrockmendel opened this issue Feb 17, 2023 · 4 comments
Open

BUG: Groupby.agg with non-reducing UDF #51445

jbrockmendel opened this issue Feb 17, 2023 · 4 comments
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Feb 17, 2023

In GroupBy._python_agg_general we have a check

        if self.ngroups == 0:
            # e.g. test_evaluate_with_empty_groups different path gets different
            #  result dtype in empty case.
            return self._python_apply_general(f, self._selected_obj, is_agg=True)

The only test that fails if we disable this check is test_evaluate_with_empty_groups, which is passing lambda x: x as its ostensibly-aggregating function. Without this special-casing, we return object dtype instead of float64 dtype in this case. I'm not sure what our policy is on requiring actually-aggregating functions in agg (cc @rhshadrach?) but this feels sketchy.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member Groupby Apply Apply, Aggregate, Transform, Map and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 17, 2023
@rhshadrach
Copy link
Member

rhshadrach commented Feb 20, 2023

I'm not sure what our policy is on requiring actually-aggregating functions in agg (cc @rhshadrach?)

Because apply and agg share code paths in various cases, and apply is allowed any UDF, we currently allow it in agg in some places as well. This should be addressed by #49673. My current view is that we should allow UDFs that don't return scalars in agg, but still treat it as an aggregation. An example where this is useful:

df = DataFrame({"a": [1, 1, 2, 2, 2], "b": range(5)})
gb = df.groupby("a")
result = gb.agg(list)
print(result)
#            b
# a           
# 1     [0, 1]
# 2  [2, 3, 4]

Replacing list with np.array might also be useful, but we currently raise ValueError: Must produce aggregated value there.

@jbrockmendel
Copy link
Member Author

thanks for clarifying. should we close this or classify it as "closed by #49673" or something else?

@rhshadrach
Copy link
Member

Removing the block highlighted in the OP, I'm not seeing any tests fail on main. This logic seems to be repeated after the for loop iterating over slices:

if not output:
# e.g. test_margins_no_values_no_cols
return self._python_apply_general(f, self._selected_obj)

The only difference is the lack of passing is_agg=True, but this argument is not utilized in _python_apply_general.

I'm guessing the case after the for loop might have been useful when failure was allowed, but should not longer be necessary. But it makes sense to me to have one of the checks remain to have apply and agg return the same result in the empty case.

@jbrockmendel
Copy link
Member Author

The code in the OP used to be shared by SeriesGroupBy._python_agg_general and DataFrameGroupBy._python_agg_general until last week, now is only used by DFGB. The SGB case actually did need it, but it got refactored out to be more explicit. So removing it from DFGB._python_agg_general doesn't break any tests, but plausibly could. I wouldn't mind ripping it out anyway.

The if not output check should become unnecessary after your PR changing DFGB._iterate_slices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby
Projects
None yet
Development

No branches or pull requests

2 participants