BUG/ENH: group cummin/max handle skipna #41854

mzeitlin11 · 2021-06-07T15:38:49Z

xref BUG: cummin, max, sum not valid when used with agg + NaN not skipped. #34047
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Built on #41853

Previously skipna was just ignored, so there was a behavior difference between cummin/max and other cumulative functions eg

>>> df = pd.DataFrame({"a": [1, np.nan, 1]})
>>> df.groupby([1, 1, 1]).cumsum(skipna=False)
     a
0  1.0
1  NaN
2  NaN
>>> df.groupby([1, 1, 1]).cummax(skipna=False)
     a
0  1.0
1  NaN
2  1.0

Benchmarks unaffected:

       before           after         ratio
     [3f67dc33]       [3c869ce4]
     <master>         <group_cummin_max_skipna>
         21.9±1ms         21.1±2ms     0.96  groupby.CumminMax.time_frame_transform('Float64', 'cummax')
         22.8±2ms         23.4±2ms     1.02  groupby.CumminMax.time_frame_transform('Float64', 'cummin')
         19.3±4ms         18.1±1ms     0.94  groupby.CumminMax.time_frame_transform('Int64', 'cummax')
         26.3±2ms         22.2±3ms    ~0.84  groupby.CumminMax.time_frame_transform('Int64', 'cummin')
       19.7±0.7ms         13.7±1ms    ~0.70  groupby.CumminMax.time_frame_transform('float64', 'cummax')
         16.4±1ms       13.9±0.2ms    ~0.84  groupby.CumminMax.time_frame_transform('float64', 'cummin')
         34.6±3ms         31.1±4ms    ~0.90  groupby.CumminMax.time_frame_transform('int64', 'cummax')
       30.5±0.6ms       29.1±0.4ms     0.95  groupby.CumminMax.time_frame_transform('int64', 'cummin')
         23.3±2ms         21.0±3ms    ~0.90  groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummax')
         21.1±3ms         21.3±3ms     1.01  groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummin')
         21.1±1ms         22.2±2ms     1.06  groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummax')
         19.7±2ms         20.1±3ms     1.02  groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummin')
       16.1±0.8ms         13.6±1ms    ~0.85  groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummax')
         13.9±1ms       14.0±0.8ms     1.01  groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummin')
         15.8±1ms       14.2±0.2ms    ~0.90  groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummax')
       15.0±0.6ms       16.6±0.9ms    ~1.11  groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummin')

mzeitlin11 · 2021-06-07T15:43:09Z

pandas/core/groupby/groupby.py

        if axis != 0:
            return self.apply(lambda x: np.minimum.accumulate(x, axis))

-        return self._cython_transform("cummin", numeric_only=False)
+        return self._cython_transform("cummin", numeric_only=False, skipna=skipna)


Ideally cummin/max could handle kwargs the same way as cumsum does, but the problem is the forced numeric_only=False.

As a followup change, it would be nice to remove this inconsistency (especially since numeric_only=True would just be ignored). Would this require a deprecation cycle, or could current behavior of ignoring numeric_only=True be considered a bug?

Making this change could enable actually exposing skipna and numeric_only in the signatures of these cumulative functions, instead of having docs that just show *args, **kwargs

or could current behavior of ignoring numeric_only=True be considered a bug?

I think so, yes.

Making this change could enable actually exposing skipna and numeric_only in the signatures of these cumulative functions, instead of having docs that just show *args, **kwargs

Yah, looks like we currently silently ignore args/kwargs, which isnt great.

Ok sounds good, will plan to leave as followon since should be an independent improvement from these changes

…skipna

jreback · 2021-06-18T01:59:08Z

does this close #34047 ? (i guess cumsum missing?)

jreback · 2021-06-18T01:59:17Z

cc @rhshadrach @jbrockmendel

mzeitlin11 · 2021-06-18T17:01:50Z

does this close #34047 ? (i guess cumsum missing?)

cumsum respects skipna (at least the groupby version). I don't think it closes the issue because the issue was primarily about using agg with these functions

jbrockmendel · 2021-06-21T23:52:29Z

pandas/tests/groupby/test_function.py

+def test_cummin_max_skipna(method, dtype, groups, expected_data):
+    # GH-34047
+    df = DataFrame({"a": Series([1, None, 2], dtype=dtype)})
+    result = getattr(df.groupby(groups)["a"], method)(skipna=False)


nitpick: can you do this on multiple lines, e.g. gb = df.groupby(...)? it makes it more obvious to me what behavior is actually being tested

Thanks, agreed that is clearer

…skipna

rhshadrach

lgtm

rhshadrach · 2021-07-23T02:08:01Z

@jbrockmendel - sounds like you're good here?

@mzeitlin11 - can you merge master.

…skipna

jbrockmendel · 2021-07-24T20:12:36Z

pandas/core/groupby/groupby.py

@@ -2784,10 +2784,11 @@ def cummin(self, axis=0, **kwargs):
        -------
        Series or DataFrame
        """
+        skipna = kwargs.get("skipna", True)


can this go in the signature explicitly?

Some context in #41854 (comment).

Happy to make that change here, was planning on doing in followup to make skipna, numeric_only consistent for cumsum, cummin/max

yeah i agree we want this in the signature, can be a followup is ok.

jbrockmendel · 2021-07-24T20:13:20Z

pandas/_libs/groupby.pyx

        intp_t lab

    N, K = (<object>values).shape
+    seen_na = np.zeros((<object>accum).shape[0], dtype=np.uint8)


can we avoid initializing this in some cases?

Not sure here since we can't tell if a missing value is present without looking through the whole array. I suppose we could only allocate as soon as a missing value is seen, but that would add some complexity since algo is wrapped in a nogil.

Hopefully this is not a big memory addition since it is ngroups * ncols bytes vs the nrow * ncols * 4 for the values (since groupby_t only covers (int64/float64/32)

jbrockmendel · 2021-07-24T20:14:37Z

pandas/_libs/groupby.pyx

-                    if compute_max:
-                        if val > mval:
-                            accum[lab, j] = mval = val
+                if not skipna and seen_na[lab]:


Noticing duplication b/t this and the non-masked version above. in a recent PR to add masked support for min/max i found that combining the two functions (and so doing a check for mask is None inside the loop) didn't make a noticeable perf difference)

Agreed about the duplication not being great - split these in #41853 because when I originally tried to make this change with one combined algo, the logic ended up getting pretty convoluted, so I thought the tradeoff was worth it (but agreed about perf, saw no perf improvement from avoiding the mask is None check in #41853)

I'd hope that at some point when more masked algos are supported, it'll make sense to keep only the masked kernels (and compute the mask outside, pass it in regardless of type). This would help avoid some of ugliness around determining nulls with stuff like datetimelike.

makes sense, better to de-duplicate in a dedicated pass anyway

jbrockmendel · 2021-07-24T20:14:54Z

pandas/_libs/groupby.pyx

+        na_val = 0
+
+    N, K = (<object>values).shape
+    seen_na = np.zeros((<object>accum).shape[0], dtype=np.uint8)


can we avoid allocating in some cases?

Have changed to only allocate if a missing value is possible. Could avoid more aggressively perhaps (eg only as soon as a missing value is seen), see comment for the masked case

jreback

looks fine, conflict and can you rebase and ping on green just to be sure.

jreback · 2021-07-28T01:56:02Z

pandas/core/groupby/groupby.py

@@ -2784,10 +2784,11 @@ def cummin(self, axis=0, **kwargs):
        -------
        Series or DataFrame
        """
+        skipna = kwargs.get("skipna", True)


yeah i agree we want this in the signature, can be a followup is ok.

…skipna

jreback · 2021-07-31T00:58:22Z

thanks @mzeitlin11 very nice!

mzeitlin11 added 7 commits June 6, 2021 16:56

precommit fixup

97d7e11

precommit fixup

ee7a7ef

wip

885f8ae

Remove unused

51f8f9c

Merge ref branch

d7cf5b5

Add multigroup tests

21e5157

Cleaner kwargs handling

7858fc6

mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jun 7, 2021

mzeitlin11 commented Jun 7, 2021

View reviewed changes

mzeitlin11 added 5 commits June 7, 2021 11:51

Fix build warning

3c869ce

Merge remote-tracking branch 'upstream/master' into group_cummin_max_…

e8a08bf

…skipna

Fixup merge conflict and move whatsnew

c448837

Shrink diff

9333d58

Merge remote-tracking branch 'upstream/master' into group_cummin_max_…

1aa3586

…skipna

jreback added this to the 1.4 milestone Jun 18, 2021

jbrockmendel reviewed Jun 21, 2021

View reviewed changes

mzeitlin11 added 3 commits June 25, 2021 21:26

Merge remote-tracking branch 'upstream/master' into group_cummin_max_…

856a1b2

…skipna

Clean up test

dfe8b73

Merge remote-tracking branch 'upstream/master' into group_cummin_max_…

22bef11

…skipna

rhshadrach approved these changes Jul 23, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into group_cummin_max_…

0d57fc0

…skipna

jbrockmendel reviewed Jul 24, 2021

View reviewed changes

Avoid seen na alloc, fix multicol case

5ccdf94

jreback approved these changes Jul 28, 2021

View reviewed changes

mzeitlin11 added 2 commits July 27, 2021 22:03

Merge remote-tracking branch 'upstream/master' into group_cummin_max_…

fb76640

…skipna

Merge remote-tracking branch 'upstream/master' into group_cummin_max_…

a3db4de

…skipna

jreback merged commit d0eba0a into pandas-dev:master Jul 31, 2021

mzeitlin11 deleted the group_cummin_max_skipna branch July 31, 2021 01:25

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

BUG/ENH: group cummin/max handle skipna (pandas-dev#41854)

0b35035

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/ENH: group cummin/max handle skipna #41854

BUG/ENH: group cummin/max handle skipna #41854

mzeitlin11 commented Jun 7, 2021 •

edited

Loading

mzeitlin11 Jun 7, 2021

jbrockmendel Jun 21, 2021

mzeitlin11 Jun 26, 2021

jreback commented Jun 18, 2021

jreback commented Jun 18, 2021

mzeitlin11 commented Jun 18, 2021

jbrockmendel Jun 21, 2021

mzeitlin11 Jun 26, 2021

rhshadrach left a comment

rhshadrach commented Jul 23, 2021

jbrockmendel Jul 24, 2021

mzeitlin11 Jul 25, 2021

jreback Jul 28, 2021

jbrockmendel Jul 24, 2021

mzeitlin11 Jul 25, 2021

jbrockmendel Jul 24, 2021

mzeitlin11 Jul 25, 2021

jbrockmendel Jul 25, 2021

jbrockmendel Jul 24, 2021

mzeitlin11 Jul 25, 2021

jreback left a comment

jreback Jul 28, 2021

jreback commented Jul 31, 2021

BUG/ENH: group cummin/max handle skipna #41854

BUG/ENH: group cummin/max handle skipna #41854

Conversation

mzeitlin11 commented Jun 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jun 18, 2021

jreback commented Jun 18, 2021

mzeitlin11 commented Jun 18, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach commented Jul 23, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jul 31, 2021

mzeitlin11 commented Jun 7, 2021 •

edited

Loading