-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DEPR: DataFrameGroupBy numeric_only defaulting to True #46072
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Nice analysis. The case that bit my team was (5) above - we had a DF with a single column that contained lists, and used One thing to consider - the parameter |
Good call - I do plan to look into this as well, but would be delighted if someone else wants to :) |
@rhshadrach finally caught up on the thread, not clear what input you're looking for |
@jbrockmendel - In the bottom half (points 1-6) I've outlined what the behavior is in 1.4 and will be in 2.0 (first bullet point) and what the behavior should be in 1.5 (second bullet point). Just looking for a thumbs up or down (and why). |
In 2.0 we're going to have numeric_only=False be the default, and when the user specifies numeric_only=True we're actually going to fully respect that right? And be consistent across the board? If so then thumbs up. |
From Pandas changelog: "Changed default of numeric_only in various DataFrameGroupBy methods; all methods now default to numeric_only=False (GH46072)." See pandas-dev/pandas#46072
Why was this changed? numeric_only=False should be the edge case not the default. |
As I understand, using numeric_only=False would ignore timedelta columns. this thread started it all I think: #42395 (comment) |
Pandas groupby call now defaults to `numeric_only=False`. See pandas-dev/pandas#46072. This change breaks `get_filter_stats` call when `data.obs` contains `str` and `category` dtype. It fixes by explicitly calling `median` function with `numeric_only=True`.
Context
A summary of this behavior and the consensus thus far that DataFrameGroupBy will have
numeric_only
default to False in 2.0 can be found here: #42395 (comment).In #41475, the silent dropping of nuisance columns was deprecated.
In #43154, the behavior was changed so that when a DataFrame has
numeric_only
unspecified and subsetting to numeric only columns would leave the DataFrame empty, internally pandas treatsnumeric_only
asFalse
.Even though there is consensus that
numeric_only
should default to False, because of the above changes I wanted to make sure there is a consensus on how to go about doing so before proceeding.For the discussion below, it is useful to have three types of columns in mind:
numeric_only=True
.numeric_only=True
but can still be successfully aggregated; e.g. strings withsum
.numeric_only=True
and cannot be successfully aggregated; e.g.object
.Code
To investigate this on 1.4.x, I have been using the following code. In this code, I am using
.sum()
. However the results for any reduction or transform, whether it be string or callable, should have the same behavior (though that is not the case today). This includes apply and using axis=1 (for which you may want to tilt your head 90 degrees to the left).Code
Current and Future behavior
numeric_only=True
Current behavior appears entirely correct and will go unchanged in 1.5/2.0. In particular, when there are no numeric columns in the input, the output is empty as well.
numeric_only=False
Current behavior appears entirely correct, in that if there are to be any behavior changes in 2.0, we already emit the appropriate FutureWarning today. The only case where there will be a behavior change from 1.4.x to 2.0 is if the frame contains a nonnumeric column that can't be aggregated. 1.4.x drops the column whereas 2.0 will raise a TypeError.
numeric_only
unspecified (lib.no_default
)I'll refer to the columns as in the code above:
Columns
['B', 'C', 'D']
numeric_only
defaulting to False in 2.0.Columns
['B', 'C']
numeric_only
defaulting to False in 2.0.Columns
['B', 'D']
numeric_only
defaulting to False in 2.0.Columns
['C', 'D']
Columns
['C']
numeric_only
as True.Columns
['D']
cc @jreback @jbrockmendel @jorisvandenbossche @simonjayhawkins @Dr-Irv
The text was updated successfully, but these errors were encountered: