Skip to content

BUG 58031 -- groupby aggregate dtype consistency #58258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -329,10 +329,12 @@ Bug fixes
- Fixed bug in :meth:`DataFrame.to_string` that raised ``StopIteration`` with nested DataFrames. (:issue:`16098`)
- Fixed bug in :meth:`DataFrame.transform` that was returning the wrong order unless the index was monotonically increasing. (:issue:`57069`)
- Fixed bug in :meth:`DataFrame.update` bool dtype being converted to object (:issue:`55509`)
- Fixed bug in :meth:`DataFrameGroupBy.aggregate` that had inconsistent ``dtype`` behavior for ``BooleanArray`` (:issue:`58031`)
- Fixed bug in :meth:`DataFrameGroupBy.apply` that was returning a completely empty DataFrame when all return values of ``func`` were ``None`` instead of returning an empty DataFrame with the original columns and dtypes. (:issue:`57775`)
- Fixed bug in :meth:`Series.diff` allowing non-integer values for the ``periods`` argument. (:issue:`56607`)
- Fixed bug in :meth:`Series.rank` that doesn't preserve missing values for nullable integers when ``na_option='keep'``. (:issue:`56976`)
- Fixed bug in :meth:`Series.replace` and :meth:`DataFrame.replace` inconsistently replacing matching instances when ``regex=True`` and missing values are present. (:issue:`56599`)
- Fixed bug in :meth:`read_csv raising` :meth:`TypeError` when ``index_col`` is specified and ``na_values`` is a dict containing the key ``None``. (:issue:`57547`)

Categorical
^^^^^^^^^^^
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -914,7 +914,8 @@ def agg_series(
np.ndarray or ExtensionArray
"""

if not isinstance(obj._values, np.ndarray):
# if objtype is not in np.dtypes, type is preserved
if not isinstance(obj._values, np.ndarray) and obj.dtype != "boolean":
Copy link
Member

@rhshadrach rhshadrach Apr 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right fix. While the issue raised is with boolean, it can occur with any dtype.

One possible solution is to take npvalues and convert it to the corresponding dtype_backend - e.g. if it starts out as a numpy_nullable, then we can do:

from pandas.core.dtypes.cast import convert_dtypes
            
dtype = convert_dtypes(npvalues, dtype_backend="numpy_nullable")
# We don't actually want Series here - should be returning a NumPy or E array
out = Series(npvalues, dtype=dtype)

Similarly for the pyarrow backend. This seems to me to be the right behavior because it parallel's the inference we're doing with NumPy. But I can't say I feel very confident there aren't some cases where this would still go wrong.

cc @jbrockmendel @WillAyd for any thoughts.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the best alternative would be but that's unfortunate we'd have to convert pyarrow boolean types to numpy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still reading, but first thought is that instead of comparing to the string "boolean" the check should be isinstance(dtype, BooleanDtype)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looking at the test makes it clear what the underlying issue is, and its a tough one. maybe_cast_pointwise_result goes through _from_scalars, which it turns out was a mis-feature (xref #56430). What we need is an EA method that does family-preserving inference (i.e. if you start with pyarrow/masked/numpy/sparse, you end up with pyarrow/masked/numpy/sparse).

Shorter-term, it looks like BooleanArray._from_scalars is insufficiently strict; it should raise when it sees floats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe_convert_objects already has a convert_to_nullable_dtype - but only for bool/int. Could this be expanded upon - both for other nullable dtypes as well as pyarrow? It's already quite large and complex, but it seems to me there is an advantage of having a single function to call when you are in the situation of "I have an object NumPy array, and need to determine how to do type inference".

That being said, if the EA method @jbrockmendel is proposing completely supplants maybe_convert_objects in the long term, that sounds like a good route to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

knee-jerk: that might work in this case, but im wary of introducing even more ways of accomplishing family-retention across the code base. they're bound to accumulate small inconsistencies which will then be a hassle to smooth out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but im wary of introducing even more ways

No disagreement there. It just seems like your proposed EA method from #58258 (comment) needs to go through all the values, determine what is seen, come up with the output dtype, and then make it so. The first two steps are done in maybe_convert_objects, and I'd be wary of duplicating that logic elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suspect that what you're describing is likely to be the basis of the MaskedArray._whatever implementation.

# we can preserve a little bit more aggressively with EA dtype
# because maybe_cast_pointwise_result will do a try/except
# with _from_sequence. NB we are assuming here that _from_sequence
Expand Down
25 changes: 25 additions & 0 deletions pandas/tests/groupby/aggregate/test_other.py
Original file line number Diff line number Diff line change
Expand Up @@ -666,3 +666,28 @@ def weird_func(x):

result = df["decimals"].groupby(df["id1"]).agg(weird_func)
tm.assert_series_equal(result, expected, check_names=False)


def test_groupby_agg_boolean_dype():
# GH Issue #58031
# Ensure return type of aggregate dtype has consistent behavior
# for 'bool' and 'boolean' because boolean not covered under numpy

df_boolean = DataFrame({"0": [1, 2, 2], "1": [True, True, None]})
df_boolean[1] = df_boolean["1"].astype("boolean")

df_bool = DataFrame({"0": [1, 2, 2], "1": [True, True, None]})
df_bool[1] = df_bool["1"].astype("bool")

boolean_return_type = (
df_boolean.groupby("0")
.aggregate(lambda s: s.fillna(False).mean())
.dtypes.values[0]
)
bool_return_type = (
df_bool.groupby("0")
.aggregate(lambda s: s.fillna(False).mean())
.dtypes.values[0]
)

assert boolean_return_type == bool_return_type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing tests, always test the full result (using tm.assert_frame_equals here) instead of just one part of it. See the tests above for examples.