Skip to content

BUG: any/all not returning booleans for object type #41102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 6, 2021
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -696,6 +696,7 @@ Numeric
- Bug in :meth:`DataFrame.transform` would raise ``SpecificationError`` when passed a dictionary and columns were missing; will now raise a ``KeyError`` instead (:issue:`40004`)
- Bug in :meth:`DataFrameGroupBy.rank` giving incorrect results with ``pct=True`` and equal values between consecutive groups (:issue:`40518`)
- Bug in :meth:`Series.count` would result in an ``int32`` result on 32-bit platforms when argument ``level=None`` (:issue:`40908`)
- Bug in :class:`Series` and :class:`DataFrame` reductions with methods ``any`` and ``all`` not returning boolean results for object data (:issue:`12863`, :issue:`35450`, :issue:`27709`)

Conversion
^^^^^^^^^^
Expand Down
12 changes: 12 additions & 0 deletions pandas/core/nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -486,6 +486,12 @@ def nanany(
False
"""
values, _, _, _, _ = _get_values(values, skipna, fill_value=False, mask=mask)

# For object type, any won't necessarily return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this in _get_values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now _get_values is unaware of the nanop it is being used for, so would need slight refactoring to allow that.

While the current solution duplicates the workaround, I like it better than adding in _get_values since this workaround only affects any/all. I think keeping this weird condition out of a path all other nanops hit makes sense.

# boolean values (numpy/numpy#4352)
if is_object_dtype(values):
values = values.astype(bool)

# error: Incompatible return value type (got "Union[bool_, ndarray]", expected
# "bool")
return values.any(axis) # type: ignore[return-value]
Expand Down Expand Up @@ -526,6 +532,12 @@ def nanall(
False
"""
values, _, _, _, _ = _get_values(values, skipna, fill_value=True, mask=mask)

# For object type, all won't necessarily return
# boolean values (numpy/numpy#4352)
if is_object_dtype(values):
values = values.astype(bool)

# error: Incompatible return value type (got "Union[bool_, ndarray]", expected
# "bool")
return values.all(axis) # type: ignore[return-value]
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/apply/test_series_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -441,8 +441,8 @@ def test_non_callable_aggregates(how):
("sum", "abc"),
("max", "c"),
("min", "a"),
("all", "c"), # see GH12863
("any", "a"),
("all", True),
("any", True),
],
),
),
Expand Down
27 changes: 24 additions & 3 deletions pandas/tests/frame/test_reductions.py
Original file line number Diff line number Diff line change
Expand Up @@ -1068,13 +1068,17 @@ def test_idxmax_dt64_multicolumn_axis1(self):

@pytest.mark.parametrize("opname", ["any", "all"])
def test_any_all(self, opname, bool_frame_with_na, float_string_frame):
assert_bool_op_calc(
opname, getattr(np, opname), bool_frame_with_na, has_skipna=True
)
assert_bool_op_api(
opname, bool_frame_with_na, float_string_frame, has_bool_only=True
)

@pytest.mark.xfail(reason="GH12863: numpy result won't match for object type")
@pytest.mark.parametrize("opname", ["any", "all"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is trying to match numpy? do we need this anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt the original purpose was matching numpy, probably more for adding tests without having to hardcode an expected result. For now I modified this slightly to remove the need for the xfail, but probably could be removed (there's coverage, but it's scattered).

def test_any_all_matches_numpy(self, opname, bool_frame_with_na):
assert_bool_op_calc(
opname, getattr(np, opname), bool_frame_with_na, has_skipna=True
)

def test_any_all_extra(self):
df = DataFrame(
{
Expand Down Expand Up @@ -1108,6 +1112,23 @@ def test_any_all_extra(self):
result = df[["C"]].all(axis=None).item()
assert result is True

@pytest.mark.parametrize("axis", [0, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have testing of the skipna=True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some scattered testing elsewhere, added skipna parameterization in this test

@pytest.mark.parametrize("bool_agg_func", ["any", "all"])
def test_any_all_object_dtype(self, axis, bool_agg_func):
# GH#35450
df = DataFrame(
data=[
[1, np.nan, np.nan, True],
[np.nan, 2, np.nan, True],
[np.nan, np.nan, np.nan, True],
[np.nan, np.nan, np.nan, np.nan],
]
)

result = getattr(df, bool_agg_func)(axis=axis, skipna=False)
expected = Series([True, True, True, True])
tm.assert_series_equal(result, expected)

def test_any_datetime(self):

# GH 23070
Expand Down
27 changes: 25 additions & 2 deletions pandas/tests/reductions/test_reductions.py
Original file line number Diff line number Diff line change
Expand Up @@ -896,7 +896,7 @@ def test_all_any(self):

# Alternative types, with implicit 'object' dtype.
s = Series(["abc", True])
assert "abc" == s.any() # 'abc' || True => 'abc'
assert s.any()

@pytest.mark.parametrize("klass", [Index, Series])
def test_numpy_all_any(self, klass):
Expand All @@ -913,7 +913,7 @@ def test_all_any_params(self):
s2 = Series([np.nan, False])
assert s1.all(skipna=False) # nan && True => True
assert s1.all(skipna=True)
assert np.isnan(s2.any(skipna=False)) # nan || False => nan
assert s2.any(skipna=False)
assert not s2.any(skipna=True)

# Check level.
Expand Down Expand Up @@ -941,6 +941,29 @@ def test_all_any_params(self):
with pytest.raises(NotImplementedError, match=msg):
s.all(bool_only=True)

@pytest.mark.parametrize("bool_agg_func", ["any", "all"])
@pytest.mark.parametrize("skipna", [True, False])
def test_any_all_object_dtype(self, bool_agg_func, skipna):
# GH#12863
ser = Series(["a", "b", "c", "d", "e"], dtype=object)
result = getattr(ser, bool_agg_func)(skipna=skipna)
expected = True

assert result == expected

@pytest.mark.parametrize("bool_agg_func", ["any", "all"])
@pytest.mark.parametrize(
"data", [[False, None], [None, False], [False, np.nan], [np.nan, False]]
)
def test_any_all_object_dtype_missing(self, data, bool_agg_func):
# GH#27709
ser = Series(data)
result = getattr(ser, bool_agg_func)(skipna=False)

# None is treated is False, but np.nan is treated as True
expected = bool_agg_func == "any" and None not in data
assert result == expected

@pytest.mark.parametrize("bool_agg_func", ["any", "all"])
@pytest.mark.parametrize("skipna", [True, False])
@pytest.mark.parametrize(
Expand Down
1 change: 1 addition & 0 deletions pandas/tests/test_nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,7 @@ def _badobj_wrap(self, value, func, allow_complex=True, **kwargs):
value = value.astype("f8")
return func(value, **kwargs)

@pytest.mark.xfail(reason="GH12863: numpy result won't match for object type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this ? (same reason above)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar response - think more for coverage than matching numpy. Since the nanop gets indirectly tested for all other any/all reductions, I think it could be removed

@pytest.mark.parametrize(
"nan_op,np_op", [(nanops.nanany, np.any), (nanops.nanall, np.all)]
)
Expand Down