Skip to content

BUG: Avoid ambiguous condition in GroupBy.first / last #32124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 23, 2020
Merged
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.0.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ Bug fixes

- Fix bug in :meth:`DataFrame.convert_dtypes` for columns that were already using the ``"string"`` dtype (:issue:`31731`).
- Fixed bug in setting values using a slice indexer with string dtype (:issue:`31772`)
- Fixed bug where :meth:`GroupBy.first` and :meth:`GroupBy.last` would raise a ``TypeError`` when groups contained ``pd.NA`` in a column of object dtype (:issue:`32123`)

.. ---------------------------------------------------------------------------

Expand Down
6 changes: 4 additions & 2 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@ from pandas._libs.algos cimport (swap, TiebreakEnumType, TIEBREAK_AVERAGE,
from pandas._libs.algos import (take_2d_axis1_float64_float64,
groupsort_indexer, tiebreakers)

from pandas._libs.missing import NA

cdef int64_t NPY_NAT = get_nat()
_int64_max = np.iinfo(np.int64).max

Expand Down Expand Up @@ -887,7 +889,7 @@ def group_last(rank_t[:, :] out,
for j in range(K):
val = values[i, j]

if val == val:
if (val is not NA) and (val == val):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think can use the checknull function from missing instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just not checknull(val) for the whole condition? I think there was an issue with checknull not catching decimal.Decimal("nan") but I could try to add that here

# NB: use _treat_as_na here once
# conditional-nogil is available.
nobs[lab, j] += 1
Expand Down Expand Up @@ -976,7 +978,7 @@ def group_nth(rank_t[:, :] out,
for j in range(K):
val = values[i, j]

if val == val:
if (val is not NA) and (val == val):
# NB: use _treat_as_na here once
# conditional-nogil is available.
nobs[lab, j] += 1
Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/groupby/test_nth.py
Original file line number Diff line number Diff line change
Expand Up @@ -530,3 +530,20 @@ def test_nth_nan_in_grouper(dropna):
)

tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("method", ["first", "last"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test .nth(0) and .nth(-1) which are the same results (except the nth -1 will have the null as the result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for nth with nulls; I gave it its own test rather than putting it with first / last to avoid too much awkward if / else branching inside the test

def test_first_last_with_na_object(method):
# https://github.com/pandas-dev/pandas/issues/32123
groups = pd.DataFrame({"a": [1, 1, 2, 2], "b": [1, 2, 3, pd.NA]}).groupby("a")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also use nulls_fixture here after #31799

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge master? This should be available now

result = getattr(groups, method)()

if method == "first":
values = {"b": [1, 3]}
else:
values = {"b": [2, 3]}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a question (that perhaps should be handled in another issue).
GroupBy.last() returns [2, 3] however GroupBy.tail(1) returns [2, pd.NA].
Is this intended behaviour? it is consistent with pandas 0.25, but undocumented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that last gives the last non-null value but that doesn't seem to be well-documented like you say (probably does warrant a separate issue)


idx = pd.Index([1, 2], name="a")
expected = pd.DataFrame(values, index=idx)

tm.assert_frame_equal(result, expected)