-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
TST(string dtype): Resolve xfail in groupby.test_size #60711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
WillAyd
merged 1 commit into
pandas-dev:main
from
rhshadrach:str_xfail_groupby_inference
Jan 24, 2025
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't it work if you just remove the
dtype
argument and let the constructor infer?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - this was introduced in #55627 but I do not see why if the values are
string[pyarrow]
that the result would beInt64
.cc @phofl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach the Int64 is for
exp_dtype
on the line below, not for the dtype of the Index being constructed on this line, so I am not entirely understanding your comment/question ?(the construction of
exp_dtype
is not being touched in this PR)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks! When the grouping column is StringDtype, we preserve this even when infer_string is False. In the groupby code, the uniques that go into creating the index is a string array. When the input is object dtype and
infer_string=True
, we do inference on the values and coerce to dtypestr
.So in the object case we're doing inference, whereas in the non-object case we are not. It seems reasonable to me, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK thanks for the explanation. I am not sure how I feel yet, but at first I wasn't expecting the action of grouping to perform any inference. Is that not a performance hit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, if this is not easy to "fix" (avoid the inference), then I am personally fine with the current behaviour for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great - I think we are all leaning in that direction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The line in question is
pandas/pandas/core/groupby/ops.py
Line 755 in e3b2de8
While it's been moved around recently, I believe that's long standing behavior. I can investigate what impact removing that (so just using standard
Index
init) would have, but seems independent.