Skip to content

ENH: Support mask in GroupBy.cumsum #48070

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 18, 2022

Conversation

phofl
Copy link
Member

@phofl phofl commented Aug 13, 2022

This is a bit too much if else imo, but I would like to use this as a starting point and try refactoring a bit afterwards. This helps with reviewing and also makes it easier to remember every case

cc @jorisvandenbossche

@phofl phofl added Bug Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Aug 13, 2022
@@ -207,15 +207,24 @@ def group_cumprod_float64(
break


ctypedef fused cumsum_t:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of switching to this?

Copy link
Member Author

@phofl phofl Aug 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping precision for Int dtypes with missing values. Currently this is cast to float, which loses the value for large integers

also should help performance for ea arrays

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this overflow pretty easily right now with int8 dtypes etc, this is also fixed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be misreading but this is just a subset of the types that were previously used in numeric_t right? I'm a bit confused how restricting the types helps with performance / precision. Generally don't think we should be creating types specific to each algorithm

Copy link
Member Author

@phofl phofl Aug 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two things here:

  • When using int8 as a dtype, we can easily get overflows, because the type is not adjusted, for example a group with [111, 111] would overflow, so casting to int64 beforehand avoids this. This is the bugfix

  • Secondly, currently ea dtypes like Int64 are cast to float before calling group_cumsum, which is losing precision for high integers that did not fit into float64. Additionally, using the mask improves performance for extension array dtypes.

I don't want to create specific dtypes per function. I want to keep the mask support for every function in separate prs. When more and more get merged, I will be able to combine the types and then we will be able to use these types for more functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the type to make it clear that I intend to reuse it

@@ -261,23 +280,41 @@ def group_cumsum(
for j in range(K):
val = values[i, j]

isna_entry = _treat_as_na(val, is_datetimelike)
if uses_mask:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than continue if uses_mask: checks we make that the outermost branch? Might help readability to keep the logic in two different branches rather than continued checks within one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that, but imo that reduces readability. The uses_mask is a simple if_else branch, if we move this outside, it is hard to see that the actual logic of the algorithm is the same in both branches (and we have to keep it consistent over time).

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

- Bug in :meth:`.GroupBy.cumsum` with ``timedelta64[ns]`` dtype failing to recognize ``NaT`` as a null value (:issue:`46216`)
- Bug in :meth:`GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Bug in :meth:`GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)
- Bug in :meth:`.GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`)

@@ -1078,8 +1078,9 @@ Groupby/resample/rolling
- Bug when using ``engine="numba"`` would return the same jitted function when modifying ``engine_kwargs`` (:issue:`46086`)
- Bug in :meth:`.DataFrameGroupBy.transform` fails when ``axis=1`` and ``func`` is ``"first"`` or ``"last"`` (:issue:`45986`)
- Bug in :meth:`DataFrameGroupBy.cumsum` with ``skipna=False`` giving incorrect results (:issue:`46216`)
- Bug in :meth:`GroupBy.sum` with integer dtypes losing precision (:issue:`37493`)
- Bug in :meth:`GroupBy.sum` and :meth:`GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Bug in :meth:`GroupBy.sum` and :meth:`GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)
- Bug in :meth:`.GroupBy.sum` and :meth:`.GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`)

(the leading dot is kind of a wildcard * so that sphinx looks in the full nested pandas namespace for a match (and not only in the top-level namespace), to avoid writing it out as pandas.core.groupby.GroupBy.cumsum)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx, I've also changed the references from the other pro that were already merged

@mroeschke mroeschke added this to the 1.5 milestone Aug 18, 2022
@jorisvandenbossche jorisvandenbossche merged commit c19a4ad into pandas-dev:main Aug 18, 2022
@phofl phofl deleted the groupby_cumsum_mask branch August 18, 2022 20:29
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants