Skip to content

ENH: Support mask in GroupBy.cumsum #48070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 18, 2022
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1070,8 +1070,9 @@ Groupby/resample/rolling
- Bug when using ``engine="numba"`` would return the same jitted function when modifying ``engine_kwargs`` (:issue:`46086`)
- Bug in :meth:`.DataFrameGroupBy.transform` fails when ``axis=1`` and ``func`` is ``"first"`` or ``"last"`` (:issue:`45986`)
- Bug in :meth:`DataFrameGroupBy.cumsum` with ``skipna=False`` giving incorrect results (:issue:`46216`)
- Bug in :meth:`GroupBy.sum` with integer dtypes losing precision (:issue:`37493`)
- Bug in :meth:`GroupBy.sum` and :meth:`GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Bug in :meth:`GroupBy.sum` and :meth:`GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)
- Bug in :meth:`.GroupBy.sum` and :meth:`.GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)

(the leading dot is kind of a wildcard * so that sphinx looks in the full nested pandas namespace for a match (and not only in the top-level namespace), to avoid writing it out as pandas.core.groupby.GroupBy.cumsum)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, I've also changed the references from the other pro that were already merged

- Bug in :meth:`.GroupBy.cumsum` with ``timedelta64[ns]`` dtype failing to recognize ``NaT`` as a null value (:issue:`46216`)
- Bug in :meth:`GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Bug in :meth:`GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)
- Bug in :meth:`.GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)

- Bug in :meth:`.GroupBy.cummin` and :meth:`.GroupBy.cummax` with nullable dtypes incorrectly altering the original data in place (:issue:`46220`)
- Bug in :meth:`DataFrame.groupby` raising error when ``None`` is in first level of :class:`MultiIndex` (:issue:`47348`)
- Bug in :meth:`.GroupBy.cummax` with ``int64`` dtype with leading value being the smallest possible int64 (:issue:`46382`)
Expand Down
6 changes: 4 additions & 2 deletions pandas/_libs/groupby.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,14 @@ def group_cumprod_float64(
skipna: bool = ...,
) -> None: ...
def group_cumsum(
out: np.ndarray, # numeric[:, ::1]
values: np.ndarray, # ndarray[numeric, ndim=2]
out: np.ndarray, # cumsum_t[:, ::1]
values: np.ndarray, # ndarray[cumsum_t, ndim=2]
labels: np.ndarray, # const int64_t[:]
ngroups: int,
is_datetimelike: bool,
skipna: bool = ...,
mask: np.ndarray | None = ...,
result_mask: np.ndarray | None = ...,
) -> None: ...
def group_shift_indexer(
out: np.ndarray, # int64_t[::1]
Expand Down
61 changes: 49 additions & 12 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -207,15 +207,24 @@ def group_cumprod_float64(
break


ctypedef fused cumsum_t:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of switching to this?

Copy link
Member Author

@phofl phofl Aug 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping precision for Int dtypes with missing values. Currently this is cast to float, which loses the value for large integers

also should help performance for ea arrays

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this overflow pretty easily right now with int8 dtypes etc, this is also fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be misreading but this is just a subset of the types that were previously used in numeric_t right? I'm a bit confused how restricting the types helps with performance / precision. Generally don't think we should be creating types specific to each algorithm

Copy link
Member Author

@phofl phofl Aug 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two things here:

  • When using int8 as a dtype, we can easily get overflows, because the type is not adjusted, for example a group with [111, 111] would overflow, so casting to int64 beforehand avoids this. This is the bugfix

  • Secondly, currently ea dtypes like Int64 are cast to float before calling group_cumsum, which is losing precision for high integers that did not fit into float64. Additionally, using the mask improves performance for extension array dtypes.

I don't want to create specific dtypes per function. I want to keep the mask support for every function in separate prs. When more and more get merged, I will be able to combine the types and then we will be able to use these types for more functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the type to make it clear that I intend to reuse it

int64_t
uint64_t
float32_t
float64_t


@cython.boundscheck(False)
@cython.wraparound(False)
def group_cumsum(
numeric_t[:, ::1] out,
ndarray[numeric_t, ndim=2] values,
cumsum_t[:, ::1] out,
ndarray[cumsum_t, ndim=2] values,
const intp_t[::1] labels,
int ngroups,
bint is_datetimelike,
bint skipna=True,
const uint8_t[:, :] mask=None,
uint8_t[:, ::1] result_mask=None,
) -> None:
"""
Cumulative sum of columns of `values`, in row groups `labels`.
Expand All @@ -234,23 +243,33 @@ def group_cumsum(
True if `values` contains datetime-like entries.
skipna : bool
If true, ignore nans in `values`.
mask: np.ndarray[uint8], optional
Mask of values
result_mask: np.ndarray[int8], optional
Mask of out array

Notes
-----
This method modifies the `out` parameter, rather than returning an object.
"""
cdef:
Py_ssize_t i, j, N, K, size
numeric_t val, y, t, na_val
numeric_t[:, ::1] accum, compensation
cumsum_t val, y, t, na_val
cumsum_t[:, ::1] accum, compensation
uint8_t[:, ::1] accum_mask
intp_t lab
bint isna_entry, isna_prev = False
bint uses_mask = mask is not None

N, K = (<object>values).shape

if uses_mask:
accum_mask = np.zeros((ngroups, K), dtype="uint8")

accum = np.zeros((ngroups, K), dtype=np.asarray(values).dtype)
compensation = np.zeros((ngroups, K), dtype=np.asarray(values).dtype)

na_val = _get_na_val(<numeric_t>0, is_datetimelike)
na_val = _get_na_val(<cumsum_t>0, is_datetimelike)

with nogil:
for i in range(N):
Expand All @@ -261,23 +280,41 @@ def group_cumsum(
for j in range(K):
val = values[i, j]

isna_entry = _treat_as_na(val, is_datetimelike)
if uses_mask:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than continue if uses_mask: checks we make that the outermost branch? Might help readability to keep the logic in two different branches rather than continued checks within one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, but imo that reduces readability. The uses_mask is a simple if_else branch, if we move this outside, it is hard to see that the actual logic of the algorithm is the same in both branches (and we have to keep it consistent over time).

isna_entry = mask[i, j]
else:
isna_entry = _treat_as_na(val, is_datetimelike)

if not skipna:
isna_prev = _treat_as_na(accum[lab, j], is_datetimelike)
if uses_mask:
isna_prev = accum_mask[lab, j]
else:
isna_prev = _treat_as_na(accum[lab, j], is_datetimelike)

if isna_prev:
out[i, j] = na_val
if uses_mask:
result_mask[i, j] = True
else:
out[i, j] = na_val
continue

if isna_entry:
out[i, j] = na_val

if uses_mask:
result_mask[i, j] = True
else:
out[i, j] = na_val

if not skipna:
accum[lab, j] = na_val
if uses_mask:
accum_mask[lab, j] = True
else:
accum[lab, j] = na_val

else:
# For floats, use Kahan summation to reduce floating-point
# error (https://en.wikipedia.org/wiki/Kahan_summation_algorithm)
if numeric_t == float32_t or numeric_t == float64_t:
if cumsum_t == float32_t or cumsum_t == float64_t:
y = val - compensation[lab, j]
t = accum[lab, j] + y
compensation[lab, j] = t - accum[lab, j] - y
Expand Down Expand Up @@ -1041,7 +1078,7 @@ cdef numeric_t _get_na_val(numeric_t val, bint is_datetimelike):
elif numeric_t is int64_t and is_datetimelike:
na_val = NPY_NAT
else:
# Will not be used, but define to avoid uninitialized warning.
# Used in case of masks
na_val = 0
return na_val

Expand Down
3 changes: 2 additions & 1 deletion pandas/core/groupby/ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ def __init__(self, kind: str, how: str, has_dropped_na: bool) -> None:
"first",
"rank",
"sum",
"cumsum",
}

_cython_arity = {"ohlc": 4} # OHLC
Expand Down Expand Up @@ -225,7 +226,7 @@ def _get_cython_vals(self, values: np.ndarray) -> np.ndarray:
# result may still include NaN, so we have to cast
values = ensure_float64(values)

elif how == "sum":
elif how in ["sum", "cumsum"]:
# Avoid overflow during group op
if values.dtype.kind == "i":
values = ensure_int64(values)
Expand Down
25 changes: 22 additions & 3 deletions pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2810,12 +2810,15 @@ def test_single_element_list_grouping():
values, _ = next(iter(df.groupby(["a"])))


def test_groupby_sum_avoid_casting_to_float():
@pytest.mark.parametrize("func", ["sum", "cumsum"])
def test_groupby_sum_avoid_casting_to_float(func):
# GH#37493
val = 922337203685477580
df = DataFrame({"a": 1, "b": [val]})
result = df.groupby("a").sum() - val
result = getattr(df.groupby("a"), func)() - val
expected = DataFrame({"b": [0]}, index=Index([1], name="a"))
if func == "cumsum":
expected = expected.reset_index(drop=True)
tm.assert_frame_equal(result, expected)


Expand All @@ -2832,7 +2835,7 @@ def test_groupby_sum_support_mask(any_numeric_ea_dtype):


@pytest.mark.parametrize("val, dtype", [(111, "int"), (222, "uint")])
def test_groupby_sum_overflow(val, dtype):
def test_groupby_overflow(val, dtype):
# GH#37493
df = DataFrame({"a": 1, "b": [val, val]}, dtype=f"{dtype}8")
result = df.groupby("a").sum()
Expand All @@ -2842,3 +2845,19 @@ def test_groupby_sum_overflow(val, dtype):
dtype=f"{dtype}64",
)
tm.assert_frame_equal(result, expected)

result = df.groupby("a").cumsum()
expected = DataFrame({"b": [val, val * 2]}, dtype=f"{dtype}64")
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("skipna, val", [(True, 3), (False, pd.NA)])
def test_groupby_cumsum_mask(any_numeric_ea_dtype, skipna, val):
# GH#37493
df = DataFrame({"a": 1, "b": [1, pd.NA, 2]}, dtype=any_numeric_ea_dtype)
result = df.groupby("a").cumsum(skipna=skipna)
expected = DataFrame(
{"b": [1, pd.NA, val]},
dtype=any_numeric_ea_dtype,
)
tm.assert_frame_equal(result, expected)
5 changes: 3 additions & 2 deletions pandas/tests/groupby/test_libgroupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,9 +183,10 @@ def _check_cython_group_transform_cumulative(pd_op, np_op, dtype):
tm.assert_numpy_array_equal(np_op(data), answer[:, 0], check_dtype=False)


def test_cython_group_transform_cumsum(any_real_numpy_dtype):
@pytest.mark.parametrize("np_dtype", ["int64", "uint64", "float32", "float64"])
def test_cython_group_transform_cumsum(np_dtype):
# see gh-4095
dtype = np.dtype(any_real_numpy_dtype).type
dtype = np.dtype(np_dtype).type
pd_op, np_op = group_cumsum, np.cumsum
_check_cython_group_transform_cumulative(pd_op, np_op, dtype)

Expand Down