-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF/BUG: use masked algo in groupby cummin and cummax #40651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 16 commits
7cd4dc3
91984dc
b371cc5
69cce96
dd7f324
64680d4
be16f65
9442846
f089175
31409f8
0c05f74
18dcc94
5c60a1f
f0c27ce
2fa80ad
280c7e5
dca28cf
7e2fbe0
0ebb97a
0009dfd
6663832
8247f82
71e1c4f
c6cf9ee
02768ec
836175b
293dc6e
478c6c9
fa45a9a
1632b81
1bb344e
97d9eea
892a92a
a239a68
f98ca35
e7ed12f
a1422ba
482a209
4e7404d
251c02a
3de7e5e
237f86f
a1b0c04
5e1dac4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -1255,9 +1255,11 @@ def group_min(groupby_t[:, ::1] out, | |||||||||||
@cython.wraparound(False) | ||||||||||||
def group_cummin_max(groupby_t[:, ::1] out, | ||||||||||||
ndarray[groupby_t, ndim=2] values, | ||||||||||||
uint8_t[:, ::1] mask, | ||||||||||||
const intp_t[:] labels, | ||||||||||||
int ngroups, | ||||||||||||
bint is_datetimelike, | ||||||||||||
bint use_mask, | ||||||||||||
bint compute_max): | ||||||||||||
""" | ||||||||||||
Cumulative minimum/maximum of columns of `values`, in row groups `labels`. | ||||||||||||
|
@@ -1268,12 +1270,19 @@ def group_cummin_max(groupby_t[:, ::1] out, | |||||||||||
Array to store cummin/max in. | ||||||||||||
values : array | ||||||||||||
Values to take cummin/max of. | ||||||||||||
mask : array[uint8_t] | ||||||||||||
If `use_mask`, then indices represent missing values, | ||||||||||||
otherwise will be passed as a zeroed array | ||||||||||||
labels : np.ndarray[np.intp] | ||||||||||||
Labels to group by. | ||||||||||||
ngroups : int | ||||||||||||
Number of groups, larger than all entries of `labels`. | ||||||||||||
is_datetimelike : bool | ||||||||||||
True if `values` contains datetime-like entries. | ||||||||||||
use_mask : bool | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this actually worth it? we don't do this anywhere else. can you show head to head where you pass this always as True (vs your change method) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another option is to allow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call, definitely a cleaner solution. Addressing your comment @jreback, will take a look back at perf diff with forcing mask usage. IIRC the main reason for the optional mask was a slowdown in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Doesn't seem like as much of a slowdown as feared, ASV's with forced mask usage (from 09b73fc)
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this forced mask solution is cleaner because along with the simpler cython algo, it would eventually allow cleaning up some of the kludge around NaT/datetimelike. Which can lose information on edge cases with forced casts to float from pandas/pandas/core/groupby/ops.py Lines 673 to 677 in 4554635
|
||||||||||||
True if the mask should be used (otherwise we continue | ||||||||||||
as if it is not a masked algorithm). Avoids the cost | ||||||||||||
of checking for a completely zeroed mask. | ||||||||||||
compute_max : bool | ||||||||||||
True if cumulative maximum should be computed, False | ||||||||||||
if cumulative minimum should be computed | ||||||||||||
|
@@ -1287,6 +1296,7 @@ def group_cummin_max(groupby_t[:, ::1] out, | |||||||||||
groupby_t val, mval | ||||||||||||
ndarray[groupby_t, ndim=2] accum | ||||||||||||
intp_t lab | ||||||||||||
bint val_is_nan | ||||||||||||
|
||||||||||||
N, K = (<object>values).shape | ||||||||||||
accum = np.empty((ngroups, K), dtype=np.asarray(values).dtype) | ||||||||||||
|
@@ -1304,11 +1314,29 @@ def group_cummin_max(groupby_t[:, ::1] out, | |||||||||||
if lab < 0: | ||||||||||||
continue | ||||||||||||
for j in range(K): | ||||||||||||
val = values[i, j] | ||||||||||||
val_is_nan = False | ||||||||||||
|
||||||||||||
if use_mask: | ||||||||||||
if mask[i, j]: | ||||||||||||
|
||||||||||||
# `out` does not need to be set since it | ||||||||||||
# will be masked anyway | ||||||||||||
val_is_nan = True | ||||||||||||
else: | ||||||||||||
|
||||||||||||
# If using the mask, we can avoid grabbing the | ||||||||||||
# value unless necessary | ||||||||||||
val = values[i, j] | ||||||||||||
|
||||||||||||
if _treat_as_na(val, is_datetimelike): | ||||||||||||
out[i, j] = val | ||||||||||||
# Otherwise, `out` must be set accordingly if the | ||||||||||||
# value is missing | ||||||||||||
else: | ||||||||||||
val = values[i, j] | ||||||||||||
if _treat_as_na(val, is_datetimelike): | ||||||||||||
val_is_nan = True | ||||||||||||
out[i, j] = val | ||||||||||||
|
||||||||||||
if not val_is_nan: | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does it make sense to implement this as a separate function? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a strong opinion about this. The question would be the tradeoff between a bit more complexity/branching vs duplication/increased package size (if we end up adding masked support to a lot more of these grouped algos) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. any guess what the impact on package size is? potential duplication might be addressed by making e.g. refactoring L1340-1347 or L 1302-1308 into helper functions There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Based on rough estimate, binaries generated from groupby.pyx take up ~5% of total |
||||||||||||
mval = accum[lab, j] | ||||||||||||
if compute_max: | ||||||||||||
if val > mval: | ||||||||||||
|
@@ -1323,19 +1351,41 @@ def group_cummin_max(groupby_t[:, ::1] out, | |||||||||||
@cython.wraparound(False) | ||||||||||||
def group_cummin(groupby_t[:, ::1] out, | ||||||||||||
ndarray[groupby_t, ndim=2] values, | ||||||||||||
uint8_t[:, ::1] mask, | ||||||||||||
const intp_t[:] labels, | ||||||||||||
int ngroups, | ||||||||||||
bint is_datetimelike): | ||||||||||||
bint is_datetimelike, | ||||||||||||
bint use_mask): | ||||||||||||
"""See group_cummin_max.__doc__""" | ||||||||||||
group_cummin_max(out, values, labels, ngroups, is_datetimelike, compute_max=False) | ||||||||||||
group_cummin_max( | ||||||||||||
out, | ||||||||||||
values, | ||||||||||||
mask, | ||||||||||||
labels, | ||||||||||||
ngroups, | ||||||||||||
is_datetimelike, | ||||||||||||
use_mask, | ||||||||||||
compute_max=False | ||||||||||||
) | ||||||||||||
|
||||||||||||
|
||||||||||||
@cython.boundscheck(False) | ||||||||||||
@cython.wraparound(False) | ||||||||||||
def group_cummax(groupby_t[:, ::1] out, | ||||||||||||
ndarray[groupby_t, ndim=2] values, | ||||||||||||
uint8_t[:, ::1] mask, | ||||||||||||
const intp_t[:] labels, | ||||||||||||
int ngroups, | ||||||||||||
bint is_datetimelike): | ||||||||||||
bint is_datetimelike, | ||||||||||||
bint use_mask): | ||||||||||||
"""See group_cummin_max.__doc__""" | ||||||||||||
group_cummin_max(out, values, labels, ngroups, is_datetimelike, compute_max=True) | ||||||||||||
group_cummin_max( | ||||||||||||
out, | ||||||||||||
values, | ||||||||||||
mask, | ||||||||||||
labels, | ||||||||||||
ngroups, | ||||||||||||
is_datetimelike, | ||||||||||||
use_mask, | ||||||||||||
compute_max=True | ||||||||||||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -72,6 +72,10 @@ | |
maybe_fill, | ||
) | ||
|
||
from pandas.core.arrays.masked import ( | ||
BaseMaskedArray, | ||
BaseMaskedDtype, | ||
) | ||
from pandas.core.base import SelectionMixin | ||
import pandas.core.common as com | ||
from pandas.core.frame import DataFrame | ||
|
@@ -118,6 +122,11 @@ | |
}, | ||
} | ||
|
||
_CYTHON_MASKED_FUNCTIONS = { | ||
"cummin", | ||
"cummax", | ||
} | ||
|
||
|
||
@functools.lru_cache(maxsize=None) | ||
def _get_cython_function(kind: str, how: str, dtype: np.dtype, is_numeric: bool): | ||
|
@@ -155,6 +164,10 @@ def _get_cython_function(kind: str, how: str, dtype: np.dtype, is_numeric: bool) | |
return func | ||
|
||
|
||
def cython_function_uses_mask(kind: str) -> bool: | ||
return kind in _CYTHON_MASKED_FUNCTIONS | ||
|
||
|
||
class BaseGrouper: | ||
""" | ||
This is an internal Grouper class, which actually holds | ||
|
@@ -574,9 +587,52 @@ def _ea_wrap_cython_operation( | |
f"function is not implemented for this dtype: {values.dtype}" | ||
) | ||
|
||
@final | ||
def _masked_ea_wrap_cython_operation( | ||
self, | ||
kind: str, | ||
values: BaseMaskedArray, | ||
how: str, | ||
axis: int, | ||
min_count: int = -1, | ||
**kwargs, | ||
) -> BaseMaskedArray: | ||
""" | ||
Equivalent of `_ea_wrap_cython_operation`, but optimized for masked EA's | ||
and cython algorithms which accept a mask. | ||
""" | ||
orig_values = values | ||
|
||
# isna just directly returns self._mask, so copy here to prevent | ||
# modifying the original | ||
mask = isna(values).copy() | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
arr = values._data | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
if is_integer_dtype(values.dtype) or is_bool_dtype(values.dtype): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this an entirely different funtion? pls integrate to existing infra There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a different function because it's specific for MaskedArrays. Having this as a separate function is consistent with how it's currently implemented IMO (with a similar separate function for generic EAs). It could also be another There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In an initial version I tried to fold this into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see my comments. This MUST integrate with the existing infrastructure (or refactor that). Duplicating is -1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
If you don't like how the added code is structured right now, please do a concrete suggestion of how you would do it differently.
mzeitlin11 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# IntegerArray or BooleanArray | ||
arr = ensure_int_or_float(arr) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW im planning to kill off this function; for EAs this is always just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for bringing up - realized this whole condition can be simplified since we actually have an ndarray at this point |
||
|
||
res_values = self._cython_operation( | ||
kind, arr, how, axis, min_count, mask=mask, **kwargs | ||
) | ||
dtype = maybe_cast_result_dtype(orig_values.dtype, how) | ||
assert isinstance(dtype, BaseMaskedDtype) | ||
cls = dtype.construct_array_type() | ||
|
||
return cls( | ||
res_values.astype(dtype.type, copy=False), mask.astype(bool, copy=False) | ||
jorisvandenbossche marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
|
||
@final | ||
def _cython_operation( | ||
self, kind: str, values, how: str, axis: int, min_count: int = -1, **kwargs | ||
self, | ||
kind: str, | ||
values, | ||
how: str, | ||
axis: int, | ||
min_count: int = -1, | ||
mask: np.ndarray | None = None, | ||
**kwargs, | ||
) -> ArrayLike: | ||
""" | ||
Returns the values of a cython operation. | ||
|
@@ -598,10 +654,16 @@ def _cython_operation( | |
# if not raise NotImplementedError | ||
self._disallow_invalid_ops(dtype, how, is_numeric) | ||
|
||
func_uses_mask = cython_function_uses_mask(how) | ||
if is_extension_array_dtype(dtype): | ||
return self._ea_wrap_cython_operation( | ||
kind, values, how, axis, min_count, **kwargs | ||
) | ||
if isinstance(values, BaseMaskedArray) and func_uses_mask: | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return self._masked_ea_wrap_cython_operation( | ||
kind, values, how, axis, min_count, **kwargs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i really don't understand all of this code duplication. this is adding huge complexity. pls reduce it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Jeff, did you actually read the previous responses to your similar comment? (https://github.com/pandas-dev/pandas/pull/40651/files#r603319910) Can you then please answer there to the concrete reasons given. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes and its a terrible pattern. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this duplication of code is ridiculous. We have a VERY large codebase. Having this kind of separate logic is amazingling confusing & is humungous tech debt. This is heavily used code and needs to be carefully modified. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understand the concern about adding code complexity - my thinking was that if the goal is for nullable types to become the default in pandas, then direct support makes sense. And in that case, nullable types would need to be special-cased somewhere, and I think the separate function is cleaner than interleaving in If direct support for nullable dtypes is not desired, we can just close this. If it is, I'll keep trying to think of ways to achieve this without adding more code, but any suggestions there would be welcome! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Proper support for nullable dtypes is certainly desired (how to add it exactly can of course be discussed), so thanks a lot @mzeitlin11 for your efforts here. AFAIK, it's correct we need some special casing for it somewhere (that's the whole point of this PR is to add special handling for it). @jreback please try to stay constructive (eg answer to our arguments or provide concrete suggestions on where you would put it / how you would do it differently) and please mind your language (there is no need to call the approach taken by a contributor "terrible"). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This decision has not been made.
Agreed.
as Joris correctly pointed out, that is not viable ATM. I think a lot of this dispatch logic eventually belongs in WrappedCythonOp (which i've been vaguely planning on doing next time there aren't any open PRs touching this code), at which point we can reconsider flattening this
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel if you plan further refactoring of this code, I'm happy to just mothball this pr for now. The real benefit won't come in until more groupby algos allow a mask on this path anyway, so not worth adding now if it's just going to cause more pain in future refactoring. I also like the idea of approach 5 instead of this - could start looking into that if you think it's a promising direction. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
From today's call, I think the plan is to move forward with this first.
Long-term I think this is the right way to go to get the general case right, so I'd encourage you if you're interested in trying to implement this on the EA- separate PR(s). |
||
) | ||
else: | ||
return self._ea_wrap_cython_operation( | ||
kind, values, how, axis, min_count, **kwargs | ||
) | ||
|
||
is_datetimelike = needs_i8_conversion(dtype) | ||
|
||
|
@@ -613,7 +675,7 @@ def _cython_operation( | |
elif is_integer_dtype(dtype): | ||
# we use iNaT for the missing value on ints | ||
# so pre-convert to guard this condition | ||
if (values == iNaT).any(): | ||
if mask is None and (values == iNaT).any(): | ||
values = ensure_float64(values) | ||
else: | ||
values = ensure_int_or_float(values) | ||
|
@@ -628,6 +690,9 @@ def _cython_operation( | |
swapped = False | ||
if vdim == 1: | ||
values = values[:, None] | ||
if mask is not None: | ||
mask = mask[:, None] | ||
|
||
out_shape = (self.ngroups, arity) | ||
else: | ||
if axis > 0: | ||
|
@@ -641,6 +706,10 @@ def _cython_operation( | |
out_shape = (self.ngroups,) + values.shape[1:] | ||
|
||
func, values = self._get_cython_func_and_vals(kind, how, values, is_numeric) | ||
use_mask = mask is not None | ||
if func_uses_mask: | ||
if mask is None: | ||
mask = np.zeros_like(values, dtype=np.uint8, order="C") | ||
|
||
if how == "rank": | ||
out_dtype = "float" | ||
|
@@ -650,25 +719,23 @@ def _cython_operation( | |
else: | ||
out_dtype = "object" | ||
|
||
codes, _, _ = self.group_info | ||
|
||
if kind == "aggregate": | ||
codes, _, _ = self.group_info | ||
result = maybe_fill(np.empty(out_shape, dtype=out_dtype)) | ||
counts = np.zeros(self.ngroups, dtype=np.int64) | ||
result = self._aggregate(result, counts, values, codes, func, min_count) | ||
elif kind == "transform": | ||
result = maybe_fill(np.empty(values.shape, dtype=out_dtype)) | ||
|
||
# TODO: min_count | ||
result = self._transform( | ||
result, values, codes, func, is_datetimelike, **kwargs | ||
result, values, func, is_datetimelike, use_mask, mask, **kwargs | ||
) | ||
|
||
if is_integer_dtype(result.dtype) and not is_datetimelike: | ||
mask = result == iNaT | ||
if mask.any(): | ||
if not use_mask and is_integer_dtype(result.dtype) and not is_datetimelike: | ||
result_mask = result == iNaT | ||
if result_mask.any(): | ||
result = result.astype("float64") | ||
result[mask] = np.nan | ||
result[result_mask] = np.nan | ||
|
||
if kind == "aggregate" and self._filter_empty_groups and not counts.all(): | ||
assert result.ndim != 2 | ||
|
@@ -704,11 +771,30 @@ def _aggregate( | |
|
||
@final | ||
def _transform( | ||
self, result, values, comp_ids, transform_func, is_datetimelike: bool, **kwargs | ||
): | ||
self, | ||
result: np.ndarray, | ||
values: np.ndarray, | ||
transform_func, | ||
is_datetimelike: bool, | ||
use_mask: bool, | ||
mask: np.ndarray | None, | ||
**kwargs, | ||
) -> np.ndarray: | ||
|
||
_, _, ngroups = self.group_info | ||
transform_func(result, values, comp_ids, ngroups, is_datetimelike, **kwargs) | ||
comp_ids, _, ngroups = self.group_info | ||
if mask is not None: | ||
transform_func( | ||
result, | ||
values, | ||
mask, | ||
comp_ids, | ||
ngroups, | ||
is_datetimelike, | ||
use_mask, | ||
**kwargs, | ||
) | ||
else: | ||
transform_func(result, values, comp_ids, ngroups, is_datetimelike, **kwargs) | ||
|
||
return result | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this costly? worth using setup_cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks to take ~0.25s. So might be worth caching, but appears
setup_cache
can't be parameterized, so would have to ugly up the benchmark a bit.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just make N // 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep have shrunk benchmark