Skip to content

BUG: Avoid ambiguous condition in GroupBy.first / last #32124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 23, 2020
Merged

BUG: Avoid ambiguous condition in GroupBy.first / last #32124

merged 10 commits into from
Feb 23, 2020

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Feb 20, 2020

This comparison val == val happens in a lot of these groupby operations but only seems to raise here in the presence of NA. Are we just always / mostly converting to numpy beforehand in the other cases?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool thanks for PR

@@ -887,7 +889,7 @@ def group_last(rank_t[:, :] out,
for j in range(K):
val = values[i, j]

if val == val:
if (val is not NA) and (val == val):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think can use the checknull function from missing instead

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just not checknull(val) for the whole condition? I think there was an issue with checknull not catching decimal.Decimal("nan") but I could try to add that here

@pytest.mark.parametrize("method", ["first", "last"])
def test_first_last_with_na_object(method):
# https://github.com/pandas-dev/pandas/issues/32123
groups = pd.DataFrame({"a": [1, 1, 2, 2], "b": [1, 2, 3, pd.NA]}).groupby("a")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also use nulls_fixture here after #31799

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you merge master? This should be available now

@jbrockmendel
Copy link
Member

#31227 turned up a bunch of other places where having pd.NA in columns/index is going to break the world. the is_matching_na implemented there might be useful here

if method == "first":
values = {"b": [1, 3]}
else:
values = {"b": [2, 3]}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raises a question (that perhaps should be handled in another issue).
GroupBy.last() returns [2, 3] however GroupBy.tail(1) returns [2, pd.NA].
Is this intended behaviour? it is consistent with pandas 0.25, but undocumented.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that last gives the last non-null value but that doesn't seem to be well-documented like you say (probably does warrant a separate issue)

@simonjayhawkins simonjayhawkins added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Feb 21, 2020
@simonjayhawkins simonjayhawkins added this to the 1.0.2 milestone Feb 21, 2020
@simonjayhawkins
Copy link
Member

@dsaxton can you merge master to resolve conflicts

@@ -530,3 +530,23 @@ def test_nth_nan_in_grouper(dropna):
)

tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("method", ["first", "last"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test .nth(0) and .nth(-1) which are the same results (except the nth -1 will have the null as the result

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for nth with nulls; I gave it its own test rather than putting it with first / last to avoid too much awkward if / else branching inside the test

@pytest.mark.parametrize("method", ["first", "last"])
def test_first_last_with_na_object(method, nulls_fixture):
# https://github.com/pandas-dev/pandas/issues/32123
groups = pd.DataFrame({"a": [1, 1, 2, 2], "b": [1, 2, 3, nulls_fixture]}).groupby(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also move this belowt est_first_last_nth (it may actually be simpler to modify that test though)

@jreback jreback merged commit 20a84a5 into pandas-dev:master Feb 23, 2020
@jreback
Copy link
Contributor

jreback commented Feb 23, 2020

thanks @dsaxton

@jreback
Copy link
Contributor

jreback commented Feb 23, 2020

@meeseeksdev backport to 1.0.x

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 23, 2020
@dsaxton dsaxton deleted the grpby-nth branch February 23, 2020 15:07
jreback pushed a commit that referenced this pull request Feb 23, 2020
roberthdevries pushed a commit to roberthdevries/pandas that referenced this pull request Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: GroupBy.first fails with pd.NA on Series with object dtype
6 participants