-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Support mask in GroupBy.cumsum #48070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support mask in GroupBy.cumsum #48070
Conversation
# Conflicts: # pandas/tests/groupby/test_groupby.py
pandas/_libs/groupby.pyx
Outdated
@@ -207,15 +207,24 @@ def group_cumprod_float64( | |||
break | |||
|
|||
|
|||
ctypedef fused cumsum_t: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the point of switching to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping precision for Int dtypes with missing values. Currently this is cast to float, which loses the value for large integers
also should help performance for ea arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this overflow pretty easily right now with int8 dtypes etc, this is also fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be misreading but this is just a subset of the types that were previously used in numeric_t
right? I'm a bit confused how restricting the types helps with performance / precision. Generally don't think we should be creating types specific to each algorithm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two things here:
-
When using int8 as a dtype, we can easily get overflows, because the type is not adjusted, for example a group with
[111, 111]
would overflow, so casting toint64
beforehand avoids this. This is the bugfix -
Secondly, currently ea dtypes like
Int64
are cast to float before callinggroup_cumsum
, which is losing precision for high integers that did not fit intofloat64
. Additionally, using the mask improves performance for extension array dtypes.
I don't want to create specific dtypes per function. I want to keep the mask support for every function in separate prs. When more and more get merged, I will be able to combine the types and then we will be able to use these types for more functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the type to make it clear that I intend to reuse it
@@ -261,23 +280,41 @@ def group_cumsum( | |||
for j in range(K): | |||
val = values[i, j] | |||
|
|||
isna_entry = _treat_as_na(val, is_datetimelike) | |||
if uses_mask: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than continue if uses_mask:
checks we make that the outermost branch? Might help readability to keep the logic in two different branches rather than continued checks within one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that, but imo that reduces readability. The uses_mask is a simple if_else branch, if we move this outside, it is hard to see that the actual logic of the algorithm is the same in both branches (and we have to keep it consistent over time).
# Conflicts: # pandas/core/groupby/ops.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
doc/source/whatsnew/v1.5.0.rst
Outdated
- Bug in :meth:`.GroupBy.cumsum` with ``timedelta64[ns]`` dtype failing to recognize ``NaT`` as a null value (:issue:`46216`) | ||
- Bug in :meth:`GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Bug in :meth:`GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`) | |
- Bug in :meth:`.GroupBy.cumsum` with integer dtypes causing overflows when sum was bigger than maximum of dtype (:issue:`37493`) |
doc/source/whatsnew/v1.5.0.rst
Outdated
@@ -1078,8 +1078,9 @@ Groupby/resample/rolling | |||
- Bug when using ``engine="numba"`` would return the same jitted function when modifying ``engine_kwargs`` (:issue:`46086`) | |||
- Bug in :meth:`.DataFrameGroupBy.transform` fails when ``axis=1`` and ``func`` is ``"first"`` or ``"last"`` (:issue:`45986`) | |||
- Bug in :meth:`DataFrameGroupBy.cumsum` with ``skipna=False`` giving incorrect results (:issue:`46216`) | |||
- Bug in :meth:`GroupBy.sum` with integer dtypes losing precision (:issue:`37493`) | |||
- Bug in :meth:`GroupBy.sum` and :meth:`GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Bug in :meth:`GroupBy.sum` and :meth:`GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`) | |
- Bug in :meth:`.GroupBy.sum` and :meth:`.GroupBy.cumsum` with integer dtypes losing precision (:issue:`37493`) |
(the leading dot is kind of a wildcard * so that sphinx looks in the full nested pandas namespace for a match (and not only in the top-level namespace), to avoid writing it out as pandas.core.groupby.GroupBy.cumsum)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx, I've also changed the references from the other pro that were already merged
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.This is a bit too much if else imo, but I would like to use this as a starting point and try refactoring a bit afterwards. This helps with reviewing and also makes it easier to remember every case
cc @jorisvandenbossche