-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG/ENH: group cummin/max handle skipna #41854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/ENH: group cummin/max handle skipna #41854
Conversation
if axis != 0: | ||
return self.apply(lambda x: np.minimum.accumulate(x, axis)) | ||
|
||
return self._cython_transform("cummin", numeric_only=False) | ||
return self._cython_transform("cummin", numeric_only=False, skipna=skipna) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally cummin/max
could handle kwargs
the same way as cumsum
does, but the problem is the forced numeric_only=False
.
As a followup change, it would be nice to remove this inconsistency (especially since numeric_only=True
would just be ignored). Would this require a deprecation cycle, or could current behavior of ignoring numeric_only=True
be considered a bug?
Making this change could enable actually exposing skipna
and numeric_only
in the signatures of these cumulative functions, instead of having docs that just show *args
, **kwargs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or could current behavior of ignoring numeric_only=True be considered a bug?
I think so, yes.
Making this change could enable actually exposing skipna and numeric_only in the signatures of these cumulative functions, instead of having docs that just show *args, **kwargs
Yah, looks like we currently silently ignore args/kwargs, which isnt great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sounds good, will plan to leave as followon since should be an independent improvement from these changes
does this close #34047 ? (i guess cumsum missing?) |
|
def test_cummin_max_skipna(method, dtype, groups, expected_data): | ||
# GH-34047 | ||
df = DataFrame({"a": Series([1, None, 2], dtype=dtype)}) | ||
result = getattr(df.groupby(groups)["a"], method)(skipna=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: can you do this on multiple lines, e.g. gb = df.groupby(...)
? it makes it more obvious to me what behavior is actually being tested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, agreed that is clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@jbrockmendel - sounds like you're good here? @mzeitlin11 - can you merge master. |
@@ -2784,10 +2784,11 @@ def cummin(self, axis=0, **kwargs): | |||
------- | |||
Series or DataFrame | |||
""" | |||
skipna = kwargs.get("skipna", True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this go in the signature explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some context in #41854 (comment).
Happy to make that change here, was planning on doing in followup to make skipna
, numeric_only
consistent for cumsum
, cummin/max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i agree we want this in the signature, can be a followup is ok.
pandas/_libs/groupby.pyx
Outdated
intp_t lab | ||
|
||
N, K = (<object>values).shape | ||
seen_na = np.zeros((<object>accum).shape[0], dtype=np.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid initializing this in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure here since we can't tell if a missing value is present without looking through the whole array. I suppose we could only allocate as soon as a missing value is seen, but that would add some complexity since algo is wrapped in a nogil
.
Hopefully this is not a big memory addition since it is ngroups * ncols
bytes vs the nrow * ncols * 4
for the values (since groupby_t
only covers (int64/float64/32)
pandas/_libs/groupby.pyx
Outdated
if compute_max: | ||
if val > mval: | ||
accum[lab, j] = mval = val | ||
if not skipna and seen_na[lab]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticing duplication b/t this and the non-masked version above. in a recent PR to add masked support for min/max i found that combining the two functions (and so doing a check for mask is None
inside the loop) didn't make a noticeable perf difference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed about the duplication not being great - split these in #41853 because when I originally tried to make this change with one combined algo, the logic ended up getting pretty convoluted, so I thought the tradeoff was worth it (but agreed about perf, saw no perf improvement from avoiding the mask is None
check in #41853)
I'd hope that at some point when more masked algos are supported, it'll make sense to keep only the masked kernels (and compute the mask outside, pass it in regardless of type). This would help avoid some of ugliness around determining nulls with stuff like datetimelike
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, better to de-duplicate in a dedicated pass anyway
pandas/_libs/groupby.pyx
Outdated
na_val = 0 | ||
|
||
N, K = (<object>values).shape | ||
seen_na = np.zeros((<object>accum).shape[0], dtype=np.uint8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we avoid allocating in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have changed to only allocate if a missing value is possible. Could avoid more aggressively perhaps (eg only as soon as a missing value is seen), see comment for the masked case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine, conflict and can you rebase and ping on green just to be sure.
@@ -2784,10 +2784,11 @@ def cummin(self, axis=0, **kwargs): | |||
------- | |||
Series or DataFrame | |||
""" | |||
skipna = kwargs.get("skipna", True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i agree we want this in the signature, can be a followup is ok.
thanks @mzeitlin11 very nice! |
Built on #41853
Previously
skipna
was just ignored, so there was a behavior difference between cummin/max and other cumulative functions egBenchmarks unaffected: