Skip to content

BUG: Fix SeriesGroupBy.quantile for nullable integers #33138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Apr 7, 2020
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -421,7 +421,7 @@ Groupby/resample/rolling
- Bug in :meth:`GroupBy.apply` raises ``ValueError`` when the ``by`` axis is not sorted and has duplicates and the applied ``func`` does not mutate passed in objects (:issue:`30667`)
- Bug in :meth:`DataFrameGroupby.transform` produces incorrect result with transformation functions (:issue:`30918`)
- Bug in :meth:`DataFrame.groupby` and :meth:`Series.groupby` produces inconsistent type when aggregating Boolean series (:issue:`32894`)

- Bug in :meth:`SeriesGroupBy.quantile` raising on nullable integers (:issue:`33136`)

Reshaping
^^^^^^^^^
Expand Down
5 changes: 5 additions & 0 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ class providing the base-class of operations.
from pandas.core.dtypes.common import (
ensure_float,
is_datetime64_dtype,
is_extension_array_dtype,
is_integer_dtype,
is_numeric_dtype,
is_object_dtype,
Expand Down Expand Up @@ -2241,6 +2242,10 @@ def _get_cythonized_result(
for idx, obj in enumerate(self._iterate_slices()):
name = obj.name
values = obj._values
if is_extension_array_dtype(values.dtype) and is_integer_dtype(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a more generic way to handle this? @jbrockmendel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree this seems like an overly specific check that probably isn't catching other bugs. I tried just casting any ExtensionArray but got bunches of test failures (I think they had to do with datetimes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this definitely seems not-great. i think to do this right we need some EA method that gives the ndarray to use when we get here, and maybe a way to round-trip the results if the method is dtype-preserving.

Am I right in thinking that dt64tz (and maybe period?) get cast to i8 somewhere around here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this again I think it might make sense to update the pre_processor function: https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py#L1857 . Does that look right to you @jbrockmendel ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that seems plausible. im hoping there's a way to do this with values_for_argsort that can be shared by EAs, and by other order-based methods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that handle NAs in a way that would still give the right quantiles?

values.dtype
):
values = values.to_numpy(dtype=cython_dtype, na_value=np.nan)

if aggregate:
result_sz = ngroups
Expand Down
22 changes: 22 additions & 0 deletions pandas/tests/groupby/test_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -710,6 +710,28 @@ def test_apply_with_mixed_types():
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("q", [0.5, [0.25, 0.5, 0.75]])
def test_groupby_quantile_nullable_integer(q):
# https://github.com/pandas-dev/pandas/issues/33136
values = np.arange(100, dtype=float)
values[::3] = np.nan

df = pd.DataFrame(
{"a": ["x"] * 100 + ["y"] * 100, "b": pd.array(list(values) * 2, dtype="Int64")}
)
result = df.groupby("a")["b"].quantile(q)

if isinstance(q, list):
idx = pd.MultiIndex.from_product((["x", "y"], q), names=["a", None])
true_quantiles = [25.25, 49.5, 73.75]
else:
idx = pd.Index(["x", "y"], name="a")
true_quantiles = [49.5]

expected = pd.Series(true_quantiles * 2, index=idx, name="b")
tm.assert_series_equal(result, expected)


def test_func_returns_object():
# GH 28652
df = DataFrame({"a": [1, 2]}, index=pd.Int64Index([1, 2]))
Expand Down