Skip to content

DEPR: Enforce deprecation of numeric_only=None in DataFrame aggregations #49551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Nov 9, 2022
Merged
19 changes: 12 additions & 7 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -383,12 +383,17 @@ this pathological behavior (:issue:`37827`):

*New behavior*:

.. ipython:: python
:okwarning:
.. code-block:: ipython

df.mean()
In [3]: df.mean()
Out[3]:
A 1.0
dtype: float64

df[["A"]].mean()
In [4]: df[["A"]].mean()
Out[4]:
A 1.0
dtype: float64

Moreover, DataFrame reductions with ``numeric_only=None`` will now be
consistent with their Series counterparts. In particular, for
Expand All @@ -415,10 +420,10 @@ instead of casting to a NumPy array which may have different semantics (:issue:`

*New behavior*:

.. ipython:: python
:okwarning:
.. code-block:: ipython

df.any()
In [5]: df.any()
Out[5]: Series([], dtype: bool)


.. _whatsnew_120.api_breaking.python:
Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -500,6 +500,7 @@ Removal of prior version deprecations/changes
- Changed behavior of :meth:`Series.__setitem__` with an integer key and a :class:`Float64Index` when the key is not present in the index; previously we treated the key as positional (behaving like ``series.iloc[key] = val``), now we treat it is a label (behaving like ``series.loc[key] = val``), consistent with :meth:`Series.__getitem__`` behavior (:issue:`33469`)
- Removed ``na_sentinel`` argument from :func:`factorize`, :meth:`.Index.factorize`, and :meth:`.ExtensionArray.factorize` (:issue:`47157`)
- Changed behavior of :meth:`DataFrameGroupBy.apply` and :meth:`SeriesGroupBy.apply` so that ``group_keys`` is respected even if a transformer is detected (:issue:`34998`)
- Enforced deprecation ``numeric_only=None`` (the default) in DataFrame reductions that would silently drop columns that raised; ``numeric_only`` now defaults to ``False`` (:issue:`41480`)
-

.. ---------------------------------------------------------------------------
Expand Down Expand Up @@ -569,6 +570,7 @@ Timezones
Numeric
^^^^^^^
- Bug in :meth:`DataFrame.add` cannot apply ufunc when inputs contain mixed DataFrame type and Series type (:issue:`39853`)
- Bug in DataFrame reduction methods (e.g. :meth:`DataFrame.sum`) with object dtype, ``axis=1`` and ``numeric_only=False`` would not be coerced to float (:issue:`49551`)
-

Conversion
Expand Down
47 changes: 9 additions & 38 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -266,9 +266,8 @@
you to specify a location to update with some value.""",
}

_numeric_only_doc = """numeric_only : bool or None, default None
Include only float, int, boolean data. If None, will attempt to use
everything, then use only numeric data
_numeric_only_doc = """numeric_only : bool, default False
Include only float, int, boolean data.
"""

_merge_doc = """
Expand Down Expand Up @@ -10489,7 +10488,7 @@ def _reduce(
*,
axis: Axis = 0,
skipna: bool = True,
numeric_only: bool | None = None,
numeric_only: bool = False,
filter_type=None,
**kwds,
):
Expand All @@ -10498,7 +10497,6 @@ def _reduce(

# TODO: Make other agg func handle axis=None properly GH#21597
axis = self._get_axis_number(axis)
labels = self._get_agg_axis(axis)
assert axis in [0, 1]

def func(values: np.ndarray):
Expand All @@ -10524,25 +10522,22 @@ def _get_data() -> DataFrame:
data = self._get_bool_data()
return data

numeric_only_bool = com.resolve_numeric_only(numeric_only)
if numeric_only is not None or axis == 0:
if numeric_only or axis == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we go through this path unconditionally?

Copy link
Member Author

@rhshadrach rhshadrach Nov 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I don't think what I wrote was all too clear, but tried to explain this in #49551 (comment). I think we should take this path unconitionality, but there would be a behavior change for numeric_only=False, axis=1 and object dtype. I plan to open an issue and do a followup here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example to make this more explicit (on main):

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]}, dtype=object)
print(df.sum(axis=1, numeric_only=None))
print(df.sum(axis=1, numeric_only=False))
0    5.0
1    7.0
2    9.0
dtype: float64
0    5
1    7
2    9
dtype: object

Taking this path unconditionally would be to always get object dtype. I think that's the right thing to do, but would be a change in the default (numeric_only=None) behavior and plan to handle in a follow up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thanks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #49603

# For numeric_only non-None and axis non-None, we know
# which blocks to use and no try/except is needed.
# For numeric_only=None only the case with axis==0 and no object
# dtypes are unambiguous can be handled with BlockManager.reduce
# Case with EAs see GH#35881
df = self
if numeric_only_bool:
if numeric_only:
df = _get_data()
if axis == 1:
df = df.T
axis = 0

ignore_failures = numeric_only is None

# After possibly _get_data and transposing, we are now in the
# simple case where we can use BlockManager.reduce
res, _ = df._mgr.reduce(blk_func, ignore_failures=ignore_failures)
res, _ = df._mgr.reduce(blk_func, ignore_failures=False)
out = df._constructor(res).iloc[0]
if out_dtype is not None:
out = out.astype(out_dtype)
Expand All @@ -10559,36 +10554,11 @@ def _get_data() -> DataFrame:

return out

assert numeric_only is None
assert not numeric_only and axis == 1

data = self
values = data.values

try:
result = func(values)

except TypeError:
# e.g. in nanops trying to convert strs to float

data = _get_data()
labels = data._get_agg_axis(axis)

values = data.values
with np.errstate(all="ignore"):
result = func(values)

# columns have been dropped GH#41480
arg_name = "numeric_only"
if name in ["all", "any"]:
arg_name = "bool_only"
warnings.warn(
"Dropping of nuisance columns in DataFrame reductions "
f"(with '{arg_name}=None') is deprecated; in a future "
"version this will raise TypeError. Select only valid "
"columns before calling the reduction.",
FutureWarning,
stacklevel=find_stack_level(),
)
result = func(values)

if hasattr(result, "dtype"):
if filter_type == "bool" and notna(result).all():
Expand All @@ -10600,6 +10570,7 @@ def _get_data() -> DataFrame:
# try to coerce to the original dtypes item by item if we can
pass

labels = self._get_agg_axis(axis)
result = self._constructor_sliced(result, index=labels)
return result

Expand Down
Loading