ENH: Support mask in GroupBy.cumsum #48070

phofl · 2022-08-13T14:52:58Z

xref ENH: support masked arrays in groupby cython algos #37493 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This is a bit too much if else imo, but I would like to use this as a starting point and try refactoring a bit afterwards. This helps with reviewing and also makes it easier to remember every case

cc @jorisvandenbossche

# Conflicts: # pandas/tests/groupby/test_groupby.py

WillAyd · 2022-08-14T00:20:12Z

pandas/_libs/groupby.pyx

@@ -207,15 +207,24 @@ def group_cumprod_float64(
                        break


+ctypedef fused cumsum_t:


What is the point of switching to this?

Keeping precision for Int dtypes with missing values. Currently this is cast to float, which loses the value for large integers

also should help performance for ea arrays

Also, this overflow pretty easily right now with int8 dtypes etc, this is also fixed

Might be misreading but this is just a subset of the types that were previously used in numeric_t right? I'm a bit confused how restricting the types helps with performance / precision. Generally don't think we should be creating types specific to each algorithm

There are two things here:

When using int8 as a dtype, we can easily get overflows, because the type is not adjusted, for example a group with [111, 111] would overflow, so casting to int64 beforehand avoids this. This is the bugfix

Secondly, currently ea dtypes like Int64 are cast to float before calling group_cumsum, which is losing precision for high integers that did not fit into float64. Additionally, using the mask improves performance for extension array dtypes.

I don't want to create specific dtypes per function. I want to keep the mask support for every function in separate prs. When more and more get merged, I will be able to combine the types and then we will be able to use these types for more functions.

Renamed the type to make it clear that I intend to reuse it

WillAyd · 2022-08-14T00:25:08Z

pandas/_libs/groupby.pyx

@@ -261,23 +280,41 @@ def group_cumsum(
            for j in range(K):
                val = values[i, j]

-                isna_entry = _treat_as_na(val, is_datetimelike)
+                if uses_mask:


Rather than continue if uses_mask: checks we make that the outermost branch? Might help readability to keep the logic in two different branches rather than continued checks within one

I tried that, but imo that reduces readability. The uses_mask is a simple if_else branch, if we move this outside, it is hard to see that the actual logic of the algorithm is the same in both branches (and we have to keep it consistent over time).

# Conflicts: # pandas/core/groupby/ops.py

jorisvandenbossche

Looks good!

jorisvandenbossche · 2022-08-18T07:25:16Z

doc/source/whatsnew/v1.5.0.rst

 - Bug in :meth:`.GroupBy.cumsum` with ``timedelta64[ns]`` dtype failing to recognize ``NaT`` as a null value (:issue:`46216`)
+- Bug in :meth:`GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)


Suggested change

- Bug in :meth:`GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)

- Bug in :meth:`.GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)

jorisvandenbossche · 2022-08-18T07:26:23Z

doc/source/whatsnew/v1.5.0.rst

@@ -1078,8 +1078,9 @@ Groupby/resample/rolling
 - Bug when using ``engine="numba"`` would return the same jitted function when modifying ``engine_kwargs`` (:issue:`46086`)
 - Bug in :meth:`.DataFrameGroupBy.transform` fails when ``axis=1`` and ``func`` is ``"first"`` or ``"last"`` (:issue:`45986`)
 - Bug in :meth:`DataFrameGroupBy.cumsum` with ``skipna=False`` giving incorrect results (:issue:`46216`)
- Bug in :meth:`GroupBy.sum` with integer dtypes losing precision (:issue:`37493`)
+- Bug in :meth:`GroupBy.sum` and :meth:`GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)


Suggested change

- Bug in :meth:`GroupBy.sum` and :meth:`GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)

- Bug in :meth:`.GroupBy.sum` and :meth:`.GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)

(the leading dot is kind of a wildcard * so that sphinx looks in the full nested pandas namespace for a match (and not only in the top-level namespace), to avoid writing it out as pandas.core.groupby.GroupBy.cumsum)

Thx, I've also changed the references from the other pro that were already merged

phofl added 4 commits August 12, 2022 20:04

ENH: Support mask in GroupBy.cumsum

0ca2233

Merge remote-tracking branch 'upstream/main' into groupby_cumsum_mask

111a55b

# Conflicts: # pandas/tests/groupby/test_groupby.py

Change compiling

0fa1f3a

Change types

d9b10f1

phofl added Bug Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Aug 13, 2022

WillAyd reviewed Aug 14, 2022

View reviewed changes

phofl added 7 commits August 14, 2022 11:40

Merge remote-tracking branch 'upstream/main' into groupby_cumsum_mask

0f09ea8

Rename type

a7d918c

Merge remote-tracking branch 'upstream/main' into groupby_cumsum_mask

df4d4f6

# Conflicts: # pandas/core/groupby/ops.py

Remove duplicated type

3ee7854

Fix annotation

ff83f13

Merge remote-tracking branch 'upstream/main' into groupby_cumsum_mask

9deceab

Add initialize

bcb5550

jorisvandenbossche approved these changes Aug 18, 2022

View reviewed changes

Fix groupby references

5609150

mroeschke added this to the 1.5 milestone Aug 18, 2022

mroeschke approved these changes Aug 18, 2022

View reviewed changes

jorisvandenbossche merged commit c19a4ad into pandas-dev:main Aug 18, 2022

phofl deleted the groupby_cumsum_mask branch August 18, 2022 20:29

jorisvandenbossche mentioned this pull request Sep 26, 2022

ENH: support masked arrays in groupby cython algos #37493

Closed

10 tasks

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022

ENH: Support mask in GroupBy.cumsum (pandas-dev#48070)

dc1cd51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Support mask in GroupBy.cumsum #48070

ENH: Support mask in GroupBy.cumsum #48070

Uh oh!

phofl commented Aug 13, 2022 •

edited

Loading

Uh oh!

WillAyd Aug 14, 2022

Uh oh!

phofl Aug 14, 2022 •

edited

Loading

Uh oh!

phofl Aug 14, 2022

Uh oh!

WillAyd Aug 14, 2022

Uh oh!

phofl Aug 14, 2022 •

edited

Loading

Uh oh!

phofl Aug 14, 2022

Uh oh!

WillAyd Aug 14, 2022

Uh oh!

phofl Aug 14, 2022

Uh oh!

jorisvandenbossche left a comment

Uh oh!

jorisvandenbossche Aug 18, 2022

Uh oh!

jorisvandenbossche Aug 18, 2022

Uh oh!

phofl Aug 18, 2022

Uh oh!

Uh oh!

		@@ -207,15 +207,24 @@ def group_cumprod_float64(
		break


		ctypedef fused cumsum_t:

		- Bug in :meth:`.GroupBy.cumsum` with ``timedelta64[ns]`` dtype failing to recognize ``NaT`` as a null value (:issue:`46216`)
		- Bug in :meth:`GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)

	- Bug in :meth:`GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)
	- Bug in :meth:`.GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)

	- Bug in :meth:`GroupBy.sum` and :meth:`GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)
	- Bug in :meth:`.GroupBy.sum` and :meth:`.GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)

Uh oh!

ENH: Support mask in GroupBy.cumsum #48070

ENH: Support mask in GroupBy.cumsum #48070

Uh oh!

Conversation

phofl commented Aug 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl Aug 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl Aug 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

phofl commented Aug 13, 2022 •

edited

Loading

phofl Aug 14, 2022 •

edited

Loading

phofl Aug 14, 2022 •

edited

Loading