-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG 58031 -- groupby aggregate dtype consistency #58258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -666,3 +666,28 @@ def weird_func(x): | |
|
||
result = df["decimals"].groupby(df["id1"]).agg(weird_func) | ||
tm.assert_series_equal(result, expected, check_names=False) | ||
|
||
|
||
def test_groupby_agg_boolean_dype(): | ||
# GH Issue #58031 | ||
# Ensure return type of aggregate dtype has consistent behavior | ||
# for 'bool' and 'boolean' because boolean not covered under numpy | ||
|
||
df_boolean = DataFrame({"0": [1, 2, 2], "1": [True, True, None]}) | ||
df_boolean[1] = df_boolean["1"].astype("boolean") | ||
|
||
df_bool = DataFrame({"0": [1, 2, 2], "1": [True, True, None]}) | ||
df_bool[1] = df_bool["1"].astype("bool") | ||
|
||
boolean_return_type = ( | ||
df_boolean.groupby("0") | ||
.aggregate(lambda s: s.fillna(False).mean()) | ||
.dtypes.values[0] | ||
) | ||
bool_return_type = ( | ||
df_bool.groupby("0") | ||
.aggregate(lambda s: s.fillna(False).mean()) | ||
.dtypes.values[0] | ||
) | ||
|
||
assert boolean_return_type == bool_return_type | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When writing tests, always test the full result (using |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right fix. While the issue raised is with
boolean
, it can occur with any dtype.One possible solution is to take
npvalues
and convert it to the correspondingdtype_backend
- e.g. if it starts out as anumpy_nullable
, then we can do:Similarly for the
pyarrow
backend. This seems to me to be the right behavior because it parallel's the inference we're doing with NumPy. But I can't say I feel very confident there aren't some cases where this would still go wrong.cc @jbrockmendel @WillAyd for any thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the best alternative would be but that's unfortunate we'd have to convert pyarrow boolean types to numpy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still reading, but first thought is that instead of comparing to the string "boolean" the check should be
isinstance(dtype, BooleanDtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, looking at the test makes it clear what the underlying issue is, and its a tough one. maybe_cast_pointwise_result goes through _from_scalars, which it turns out was a mis-feature (xref #56430). What we need is an EA method that does family-preserving inference (i.e. if you start with pyarrow/masked/numpy/sparse, you end up with pyarrow/masked/numpy/sparse).
Shorter-term, it looks like BooleanArray._from_scalars is insufficiently strict; it should raise when it sees floats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe_convert_objects
already has aconvert_to_nullable_dtype
- but only for bool/int. Could this be expanded upon - both for other nullable dtypes as well as pyarrow? It's already quite large and complex, but it seems to me there is an advantage of having a single function to call when you are in the situation of "I have an object NumPy array, and need to determine how to do type inference".That being said, if the EA method @jbrockmendel is proposing completely supplants
maybe_convert_objects
in the long term, that sounds like a good route to me.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
knee-jerk: that might work in this case, but im wary of introducing even more ways of accomplishing family-retention across the code base. they're bound to accumulate small inconsistencies which will then be a hassle to smooth out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No disagreement there. It just seems like your proposed EA method from #58258 (comment) needs to go through all the values, determine what is seen, come up with the output dtype, and then make it so. The first two steps are done in
maybe_convert_objects
, and I'd be wary of duplicating that logic elsewhere.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i suspect that what you're describing is likely to be the basis of the MaskedArray._whatever implementation.