Skip to content

TST(string dtype): Resolve xfail in groupby.test_size #60711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 24, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 3 additions & 7 deletions pandas/tests/groupby/methods/test_size.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
import numpy as np
import pytest

from pandas._config import using_string_dtype

from pandas import (
DataFrame,
Index,
Expand Down Expand Up @@ -76,18 +74,16 @@ def test_size_series_masked_type_returns_Int64(dtype):
tm.assert_series_equal(result, expected)


# TODO(infer_string) in case the column is object dtype, it should preserve that dtype
# for the result's index
@pytest.mark.xfail(using_string_dtype(), reason="TODO(infer_string)", strict=False)
def test_size_strings(any_string_dtype):
def test_size_strings(any_string_dtype, using_infer_string):
# GH#55627
dtype = any_string_dtype
df = DataFrame({"a": ["a", "a", "b"], "b": "a"}, dtype=dtype)
result = df.groupby("a")["b"].size()
exp_dtype = "Int64" if dtype == "string[pyarrow]" else "int64"
exp_index_dtype = "str" if using_infer_string and dtype == "object" else dtype
expected = Series(
[2, 1],
index=Index(["a", "b"], name="a", dtype=dtype),
index=Index(["a", "b"], name="a", dtype=exp_index_dtype),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't it work if you just remove the dtype argument and let the constructor infer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - this was introduced in #55627 but I do not see why if the values are string[pyarrow] that the result would be Int64.

cc @phofl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach the Int64 is for exp_dtype on the line below, not for the dtype of the Index being constructed on this line, so I am not entirely understanding your comment/question ?
(the construction of exp_dtype is not being touched in this PR)

Copy link
Member Author

@rhshadrach rhshadrach Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks! When the grouping column is StringDtype, we preserve this even when infer_string is False. In the groupby code, the uniques that go into creating the index is a string array. When the input is object dtype and infer_string=True, we do inference on the values and coerce to dtype str.

So in the object case we're doing inference, whereas in the non-object case we are not. It seems reasonable to me, thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK thanks for the explanation. I am not sure how I feel yet, but at first I wasn't expecting the action of grouping to perform any inference. Is that not a performance hit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, if this is not easy to "fix" (avoid the inference), then I am personally fine with the current behaviour for now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great - I think we are all leaning in that direction

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line in question is

levels = [Index._with_infer(ping.uniques) for ping in self.groupings]

While it's been moved around recently, I believe that's long standing behavior. I can investigate what impact removing that (so just using standard Index init) would have, but seems independent.

name="b",
dtype=exp_dtype,
)
Expand Down
Loading