BUG 58031 -- groupby aggregate dtype consistency #58258

longovin · 2024-04-15T01:32:25Z

closes Inconsistent behaviour of GroupBy for BooleanArray series #58031
Tests added and passed tests added in this directory --> test_groupby_agg_boolean_dtype().
All [code checks passed]YES (https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#pre-commit).
Added type annotations to new arguments/methods/functions. N/A
Added an entry in the latest doc/source/whatsnew/v3.0.0.rst file.

rhshadrach

Thanks for the PR!

rhshadrach · 2024-04-15T21:13:02Z

pandas/core/groupby/ops.py

@@ -914,7 +914,8 @@ def agg_series(
        np.ndarray or ExtensionArray
        """

-        if not isinstance(obj._values, np.ndarray):
+        # if objtype is not in np.dtypes, type is preserved
+        if not isinstance(obj._values, np.ndarray) and obj.dtype != "boolean":


I don't think this is the right fix. While the issue raised is with boolean, it can occur with any dtype.

One possible solution is to take npvalues and convert it to the corresponding dtype_backend - e.g. if it starts out as a numpy_nullable, then we can do:

from pandas.core.dtypes.cast import convert_dtypes dtype = convert_dtypes(npvalues, dtype_backend="numpy_nullable") # We don't actually want Series here - should be returning a NumPy or E array out = Series(npvalues, dtype=dtype)

Similarly for the pyarrow backend. This seems to me to be the right behavior because it parallel's the inference we're doing with NumPy. But I can't say I feel very confident there aren't some cases where this would still go wrong.

cc @jbrockmendel @WillAyd for any thoughts.

Not sure what the best alternative would be but that's unfortunate we'd have to convert pyarrow boolean types to numpy

still reading, but first thought is that instead of comparing to the string "boolean" the check should be isinstance(dtype, BooleanDtype)

OK, looking at the test makes it clear what the underlying issue is, and its a tough one. maybe_cast_pointwise_result goes through _from_scalars, which it turns out was a mis-feature (xref #56430). What we need is an EA method that does family-preserving inference (i.e. if you start with pyarrow/masked/numpy/sparse, you end up with pyarrow/masked/numpy/sparse).

Shorter-term, it looks like BooleanArray._from_scalars is insufficiently strict; it should raise when it sees floats.

maybe_convert_objects already has a convert_to_nullable_dtype - but only for bool/int. Could this be expanded upon - both for other nullable dtypes as well as pyarrow? It's already quite large and complex, but it seems to me there is an advantage of having a single function to call when you are in the situation of "I have an object NumPy array, and need to determine how to do type inference".

That being said, if the EA method @jbrockmendel is proposing completely supplants maybe_convert_objects in the long term, that sounds like a good route to me.

knee-jerk: that might work in this case, but im wary of introducing even more ways of accomplishing family-retention across the code base. they're bound to accumulate small inconsistencies which will then be a hassle to smooth out

but im wary of introducing even more ways

No disagreement there. It just seems like your proposed EA method from #58258 (comment) needs to go through all the values, determine what is seen, come up with the output dtype, and then make it so. The first two steps are done in maybe_convert_objects, and I'd be wary of duplicating that logic elsewhere.

i suspect that what you're describing is likely to be the basis of the MaskedArray._whatever implementation.

rhshadrach · 2024-04-15T21:14:03Z

pandas/tests/groupby/aggregate/test_other.py

+        .dtypes.values[0]
+    )
+
+    assert boolean_return_type == bool_return_type


When writing tests, always test the full result (using tm.assert_frame_equals here) instead of just one part of it. See the tests above for examples.

github-actions · 2024-05-17T00:05:41Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2024-06-03T18:36:40Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

longovin added 3 commits April 14, 2024 20:06

made tests and changes for issue 58031 on GH

4f6ee28

fixed all of the pre-commit style errors for GH issue pandas-dev#58031

d71448b

documented our BUG fix for pull request

4059236

longovin requested a review from rhshadrach as a code owner April 15, 2024 01:32

mroeschke added Groupby Dtype Conversions Unexpected or buggy dtype conversions Apply Apply, Aggregate, Transform, Map labels Apr 15, 2024

rhshadrach requested changes Apr 15, 2024

View reviewed changes

jbrockmendel mentioned this pull request Apr 16, 2024

REF/EA-API: EA constructor without dtype specified #56430

Open

rhshadrach mentioned this pull request Apr 22, 2024

Convert result of group by agg to pyarrow if input is pyarrow #58129

Closed

5 tasks

github-actions bot added the Stale label May 17, 2024

mroeschke closed this Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG 58031 -- groupby aggregate dtype consistency #58258

BUG 58031 -- groupby aggregate dtype consistency #58258

longovin commented Apr 15, 2024

rhshadrach left a comment

rhshadrach Apr 15, 2024 •

edited

Loading

WillAyd Apr 15, 2024

jbrockmendel Apr 16, 2024

jbrockmendel Apr 16, 2024

rhshadrach Apr 16, 2024

jbrockmendel Apr 16, 2024

rhshadrach Apr 16, 2024

jbrockmendel Apr 16, 2024

rhshadrach Apr 15, 2024

github-actions bot commented May 17, 2024

mroeschke commented Jun 3, 2024

BUG 58031 -- groupby aggregate dtype consistency #58258

BUG 58031 -- groupby aggregate dtype consistency #58258

Conversation

longovin commented Apr 15, 2024

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Apr 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 17, 2024

mroeschke commented Jun 3, 2024

rhshadrach Apr 15, 2024 •

edited

Loading