-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API/DEPR: numeric_only kwarg for apply/reductions #28900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is in need of some clean up and could certainly be refactored; might be a tough sell to do away with the implicit dropping of "nuisance" columns though I know the OP mentions DataFrame ops, but worth mention this is very prevalent in GroupBy as well |
Agreed with this. Ignoring object-dtype for the moment, can we determine the valid dtypes to include / drop before doing the operation? |
@TomAugspurger can you provide some context for the block quote? It looks like a dask thing that might be unrelated (or could be related in a really interesting way).
I think so. Will become more certain as the exception-catching PRs in groupby progress. |
Copy-paste fail. Fixed. |
Related: what would you expect the behavior to be if you did Motivated by a case I'm troubleshooting in tests.groupby.test_function.test_arg_passthru where it looks like it is trying the op on non-numeric (in this case categorical[str]) columns and then dropping those. |
@jorisvandenbossche you might be interested in weighing in here. The current behavior of suppressing exceptions and excluding columns is a big part of the reason why groupby is so hard to reason about |
I agree that it seems a big breakage to go away from the automatic dropping of "nuisance" columns.
I would expect it to raise, like it does for |
@jreback on the call today you seemed positive on this idea, asked how many tests rely on current behavior. Looks like 50 total, roughly a third in DataFrame reductions (mostly trying to do math on strings), a third in groupby.apply/agg (some strings, some Categorical), and the rest window (mostly dt64) I've spent a big part of the afternoon trying to get the deprecation for #36076 right and its a PITA, would be a lot easier following this. |
Note that datetime/timedelta are numeric dtypes (correct?) but still don't support any/all, so I don't think it's fully related? (for categorical it's of course different) |
Thinking about how to actually implement this deprecation, it would look something like:
and then I guess after that is enforced in 2.0, we then do another deprecation to get rid of the kwarg altogether? |
#don't we normally just change the default to None; then can easily tell if it's user changed and 2.0 just drop it entirely |
numeric_only=None is the default, and its the most problem-causing of the options (True, False, None) |
ahh then can we do lib.no_default? we want to warn if it's explicitly passed |
Yes, but the behavior is going to change even if it is not explicitly passed, so we need to warn in that case too. |
Personally, I am not yet fully convinced we actually want to deprecate this. Automatically dropping columns that are not relevant for the aggregation is also quite convenient, and I am sure a ton of code relies on this behaviour. But that doesn't mean we can't try to improve the complexity of the situation. For example echoing @TomAugspurger's comment above (#28900 (comment)), would it be possible to more deterministically know which columns will be included?
If we deprecate, I think we should ensure we only raise a warning when actually something would change for the user. So eg when having only numerical columns, the exact value of |
AFAICT there are two options to achieve internally consistent behavior:
My main preference is to fix as much as possible of these without waiting for 2.0.
For sufficiently simple cases this is doable, but things get pretty FUBAR pretty quickly. |
A lot of complexity in DataFrame._reduce/apply/groupby ops is driven by the numeric_only=None. In these cases we try to apply the reduction to non-numeric dtypes and if it fails, exclude them. This hugely complicated our exception handling.
We should consider removing that option and require users to subset the appropriate columns before doing their operations.
The text was updated successfully, but these errors were encountered: