Skip to content

BUG: groupby any/all raising with pd.NA object data #42085

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 21, 2021
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,7 @@ Other enhancements
- :meth:`read_csv` and :meth:`read_json` expose the argument ``encoding_errors`` to control how encoding errors are handled (:issue:`39450`)
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` use Kleene logic with nullable data types (:issue:`37506`)
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` return a ``BooleanDtype`` for columns with nullable data types (:issue:`33449`)
- :meth:`.GroupBy.any` and :meth:`.GroupBy.all` raising with ``object`` data containing ``pd.NA`` even when ``skipna=True`` (:issue:`37501`)
- :meth:`.GroupBy.rank` now supports object-dtype data (:issue:`38278`)
- Constructing a :class:`DataFrame` or :class:`Series` with the ``data`` argument being a Python iterable that is *not* a NumPy ``ndarray`` consisting of NumPy scalars will now result in a dtype with a precision the maximum of the NumPy scalars; this was already the case when ``data`` is a NumPy ``ndarray`` (:issue:`40908`)
- Add keyword ``sort`` to :func:`pivot_table` to allow non-sorting of the result (:issue:`39143`)
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1519,7 +1519,10 @@ def _bool_agg(self, val_test, skipna):

def objs_to_bool(vals: ArrayLike) -> tuple[np.ndarray, type]:
if is_object_dtype(vals):
vals = np.array([bool(x) for x in vals])
# GH#37501: don't raise on pd.NA when skipna=True
vals = np.array(
[bool(x) if not (skipna and isna(x)) else True for x in vals]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While slightly more verbose, it seems better to me to avoid evaluating skipna at each iteration:

if skipna:
    vals = np.array(
        [bool(x) if not isna(x) else True for x in vals]
    )
else:
    vals = np.array([bool(x) for x in vals])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, have changed

elif isinstance(vals, BaseMaskedArray):
vals = vals._data.astype(bool, copy=False)
else:
Expand Down
26 changes: 26 additions & 0 deletions pandas/tests/groupby/test_any_all.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,3 +152,29 @@ def test_masked_bool_aggs_skipna(bool_agg_func, dtype, skipna, frame_or_series):

result = obj.groupby([1, 1]).agg(bool_agg_func, skipna=skipna)
tm.assert_equal(result, expected)


@pytest.mark.parametrize(
"bool_agg_func,data,expected_res",
[
("any", [pd.NA, np.nan], False),
("any", [pd.NA, 1, np.nan], True),
("all", [pd.NA, pd.NaT], True),
("all", [pd.NA, False, pd.NaT], False),
],
)
def test_object_type_missing_vals(bool_agg_func, data, expected_res, frame_or_series):
# GH#37501
obj = frame_or_series(data, dtype=object)
result = obj.groupby([1] * len(data)).agg(bool_agg_func)
expected = frame_or_series([expected_res], index=[1], dtype="bool")
tm.assert_equal(result, expected)


@pytest.mark.filterwarnings("ignore:Dropping invalid columns:FutureWarning")
@pytest.mark.parametrize("bool_agg_func", ["any", "all"])
def test_object_NA_raises_with_skipna_false(bool_agg_func):
# GH#37501
ser = Series([pd.NA], dtype=object)
with pytest.raises(TypeError, match="boolean value of NA is ambiguous"):
ser.groupby([1]).agg(bool_agg_func, skipna=False)