-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: GroupBy.agg() numeric_only deprecation with custom function #50538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If I turn the numpy function into an actual UDF by using a small wrapper:
then we actually already do get a different (and clearer) deprecation warning:
So at least we should update the warning for a numpy function to match that? Although, what's the reason the numpy functions are taking a different code path? I seem to remember that numpy functions were actually mapped to our built-in functions, but if that's the case, then it should be easy to support numeric_only for those as well? |
To illustrate, with this small (and very ugly) patch special casing the --- a/pandas/core/apply.py
+++ b/pandas/core/apply.py
@@ -122,7 +122,7 @@ class Apply(metaclass=abc.ABCMeta):
# curry if needed
if (
- (kwargs or args)
+ ((kwargs and not (len(kwargs) == 1 and "numeric_only" in kwargs)) or args)
and not isinstance(func, (np.ufunc, str))
and not is_list_like(func)
):
@@ -165,8 +167,9 @@ class Apply(metaclass=abc.ABCMeta):
if callable(arg):
f = com.get_cython_func(arg)
- if f and not args and not kwargs:
- return getattr(obj, f)()
+
+ if f and not args and (not kwargs or (len(kwargs) == 1 and "numeric_only" in kwargs)):
+ return getattr(obj, f)(**kwargs)
# caller can react
return None then the following works (as expected?): >>> df.groupby("key").agg(np.mean, numeric_only=True)
col2
key
a 2.0
b 2.0 |
The intention was to pass through kwargs so that
Only in some code paths do we use Related: #49534. There I advocated for special logic to apply the numeric_only argument (when provided) in the case of string aliases because the user shouldn't have to know that internally I don't see any other options than those given in the OP. Support numeric_only directlyTo support
Doing this will break anyone's code that is supplying a UDF with a Fix the warningFor this approach, we'd need to special case string arguments as in #49534 and fix the warning here. This is not a breaking change and doesn't involving adding to the API. The implementation is slightly tricky, but I believe doable. The real downside is it feels like playing whack-a-mole and I suspect there may be more moles to whack. |
I vaguely remember being frustrated by this a while back, clearly didn't find a good solution at the time.
Agreed that consistency is a priority. I'd lean towards not supporting numeric_only in agg etc at all. More trouble than its worth. |
Yep, I agree, that's confusing (the numpy function might have different algorithmic / floating point characteristics, or even different default parameter values, for example compare
That's indeed a related issue, thanks for the link.
We could initially inspect the UDF and raise a (deprecation) warning if they have It's indeed an additional keyword that would be added, but to be honest, in my mind we already had this keyword, since you could specify it in the common case .. (not realizing this only worked because I don't have a very strong opinion on whether we should support To describe the context where I encountered this: in geopandas we have a |
I believe this is already the case within a given code path. The issue here is that when no kwargs are provided we take the code path that maps to the pandas method, and when kwargs are provided we take the code path that doesn't map to the pandas method. Because of this inconsistency, I don't think there is a way to fix this where I would be comfortable to put it in a patch release. However I'm definitely open to suggestions here.
This would not catch the case that a user provides a UDF accepting kwargs - but perhaps that is a small enough issue that it's okay.
In your example we are just treating "mean" like the In short, I think these are significant enough behavior differences that I would not agree we already have this keyword (but that is very much an opinion).
I'm leaning toward not adding |
For #49352, I failed to realize that when passed a list, agg ignores all args / kwargs. This is a bug I plan to fix in 2.x, but it makes my solution above not viable. Instead, we can follow a similar pattern of rewording the warning that is raised. Edit: Actually, the warning message for #49352 is already okay; I think it can be closed without fixing. I made a comment there. |
We should still try to fix this inconsistency, though. It is very strange that depending on passing an argument or not, you get a different function that is applied in practice. But long term we probably want to remove this special casing altogether? (do you know if there is already an issue about this?)
We could also check if the UDF accepts kwargs based on the signature .. But yeah, I can certainly live with a decision to not generally support |
Agreed - consolidating the paths through groupby is on my roadmap (has been for a while), but it's admittedly been slow going. There are many issues about inconsistencies in output from a user perspective, but not about specific internal paths.
That isn't sufficient though - we'd need to know if it is somehow utilizing
I don't have much opposition or support for it. Some thoughts
Similar to the above, I'm +0. It seems rather straightforward to support but I don't see it enhancing the API much either. One can do |
I might add my opinion on this. In R, those meta functions(I don't know if this is the right term for functionals) can accpet two distinct parameter names. Let's look at the function For instance, the parameter names for sweep are all capitals... and any parameters that follow are in small letters which are the arguments for UDF func This can be one way to go.... (In R, using parameter names in all small letters are something like convention, I think.) Or we might use parameter names for meta functions that start with Then sweep would be like ``sweep(data_matrix, _margin=2, _stats=col_means, _func=func, x=, y=)` These are options we can take or using signature like In this way, we can keep Either way, making compatible with old versions of pandas would be impossible... One additional note, unlike R, Python needs non-keyword arguments come before keyword arguments. This makes the situtation difficult to read for UDF with positional-only and keyword arguments... Let's say UDF func with signature (a, b, /, c, d), call func(3,2,c=5, d=6) should go into .agg() like
So I would go for which make the example above to be #=== EDIT then again, if we make all parameters for it would work fine i guess... as in like even though we need to change the parameter name |
Another inconsistencies or complaints about the overall Let's say
but Finally, |
@kwhkim - please open up separate issues. It is preferable to have issues as narrowly scoped as possible. |
Case: using
groupby().agg()
with a custom numpy funtion object (np.mean
) instead of a built-in function string name ("mean"
).When using pandas 1.5, we get a future warning about numeric_only going to change (#46072), and that you can specify the keyword:
However, if you then try to specify the
numeric_only
keyword as suggested, you get an error:I know this is because when passing a custom function to
agg
, we pass through any kwargs, sonumeric_only
is also passed tonp.mean
, which of course doesn't know that keyword.But this seems a bit confusing with the current warning message. So I am wondering if we should do at least either of both:
numeric_only
only working for built-in functions specified by string name), then the warning message could be updated to be clearernumeric_only=True
for generic numpy functions / UDFs as well? (was this ever discussed before? Only searched briefly, but didn't directly find something)cc @rhshadrach @jbrockmendel
The text was updated successfully, but these errors were encountered: