Skip to content

BUG/ENH: group cummin/max handle skipna #41854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jul 31, 2021

Conversation

mzeitlin11
Copy link
Member

@mzeitlin11 mzeitlin11 commented Jun 7, 2021

Built on #41853

Previously skipna was just ignored, so there was a behavior difference between cummin/max and other cumulative functions eg

>>> df = pd.DataFrame({"a": [1, np.nan, 1]})
>>> df.groupby([1, 1, 1]).cumsum(skipna=False)
     a
0  1.0
1  NaN
2  NaN
>>> df.groupby([1, 1, 1]).cummax(skipna=False)
     a
0  1.0
1  NaN
2  1.0

Benchmarks unaffected:

       before           after         ratio
     [3f67dc33]       [3c869ce4]
     <master>         <group_cummin_max_skipna>
         21.9±1ms         21.1±2ms     0.96  groupby.CumminMax.time_frame_transform('Float64', 'cummax')
         22.8±2ms         23.4±2ms     1.02  groupby.CumminMax.time_frame_transform('Float64', 'cummin')
         19.3±4ms         18.1±1ms     0.94  groupby.CumminMax.time_frame_transform('Int64', 'cummax')
         26.3±2ms         22.2±3ms    ~0.84  groupby.CumminMax.time_frame_transform('Int64', 'cummin')
       19.7±0.7ms         13.7±1ms    ~0.70  groupby.CumminMax.time_frame_transform('float64', 'cummax')
         16.4±1ms       13.9±0.2ms    ~0.84  groupby.CumminMax.time_frame_transform('float64', 'cummin')
         34.6±3ms         31.1±4ms    ~0.90  groupby.CumminMax.time_frame_transform('int64', 'cummax')
       30.5±0.6ms       29.1±0.4ms     0.95  groupby.CumminMax.time_frame_transform('int64', 'cummin')
         23.3±2ms         21.0±3ms    ~0.90  groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummax')
         21.1±3ms         21.3±3ms     1.01  groupby.CumminMax.time_frame_transform_many_nulls('Float64', 'cummin')
         21.1±1ms         22.2±2ms     1.06  groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummax')
         19.7±2ms         20.1±3ms     1.02  groupby.CumminMax.time_frame_transform_many_nulls('Int64', 'cummin')
       16.1±0.8ms         13.6±1ms    ~0.85  groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummax')
         13.9±1ms       14.0±0.8ms     1.01  groupby.CumminMax.time_frame_transform_many_nulls('float64', 'cummin')
         15.8±1ms       14.2±0.2ms    ~0.90  groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummax')
       15.0±0.6ms       16.6±0.9ms    ~1.11  groupby.CumminMax.time_frame_transform_many_nulls('int64', 'cummin')

@mzeitlin11 mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jun 7, 2021
if axis != 0:
return self.apply(lambda x: np.minimum.accumulate(x, axis))

return self._cython_transform("cummin", numeric_only=False)
return self._cython_transform("cummin", numeric_only=False, skipna=skipna)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally cummin/max could handle kwargs the same way as cumsum does, but the problem is the forced numeric_only=False.

As a followup change, it would be nice to remove this inconsistency (especially since numeric_only=True would just be ignored). Would this require a deprecation cycle, or could current behavior of ignoring numeric_only=True be considered a bug?

Making this change could enable actually exposing skipna and numeric_only in the signatures of these cumulative functions, instead of having docs that just show *args, **kwargs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or could current behavior of ignoring numeric_only=True be considered a bug?

I think so, yes.

Making this change could enable actually exposing skipna and numeric_only in the signatures of these cumulative functions, instead of having docs that just show *args, **kwargs

Yah, looks like we currently silently ignore args/kwargs, which isnt great.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good, will plan to leave as followon since should be an independent improvement from these changes

@jreback jreback added this to the 1.4 milestone Jun 18, 2021
@jreback
Copy link
Contributor

jreback commented Jun 18, 2021

does this close #34047 ? (i guess cumsum missing?)

@jreback
Copy link
Contributor

jreback commented Jun 18, 2021

cc @rhshadrach @jbrockmendel

@mzeitlin11
Copy link
Member Author

does this close #34047 ? (i guess cumsum missing?)

cumsum respects skipna (at least the groupby version). I don't think it closes the issue because the issue was primarily about using agg with these functions

def test_cummin_max_skipna(method, dtype, groups, expected_data):
# GH-34047
df = DataFrame({"a": Series([1, None, 2], dtype=dtype)})
result = getattr(df.groupby(groups)["a"], method)(skipna=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: can you do this on multiple lines, e.g. gb = df.groupby(...)? it makes it more obvious to me what behavior is actually being tested

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, agreed that is clearer

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach
Copy link
Member

@jbrockmendel - sounds like you're good here?

@mzeitlin11 - can you merge master.

@@ -2784,10 +2784,11 @@ def cummin(self, axis=0, **kwargs):
-------
Series or DataFrame
"""
skipna = kwargs.get("skipna", True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this go in the signature explicitly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some context in #41854 (comment).

Happy to make that change here, was planning on doing in followup to make skipna, numeric_only consistent for cumsum, cummin/max

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i agree we want this in the signature, can be a followup is ok.

intp_t lab

N, K = (<object>values).shape
seen_na = np.zeros((<object>accum).shape[0], dtype=np.uint8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid initializing this in some cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure here since we can't tell if a missing value is present without looking through the whole array. I suppose we could only allocate as soon as a missing value is seen, but that would add some complexity since algo is wrapped in a nogil.

Hopefully this is not a big memory addition since it is ngroups * ncols bytes vs the nrow * ncols * 4 for the values (since groupby_t only covers (int64/float64/32)

if compute_max:
if val > mval:
accum[lab, j] = mval = val
if not skipna and seen_na[lab]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticing duplication b/t this and the non-masked version above. in a recent PR to add masked support for min/max i found that combining the two functions (and so doing a check for mask is None inside the loop) didn't make a noticeable perf difference)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed about the duplication not being great - split these in #41853 because when I originally tried to make this change with one combined algo, the logic ended up getting pretty convoluted, so I thought the tradeoff was worth it (but agreed about perf, saw no perf improvement from avoiding the mask is None check in #41853)

I'd hope that at some point when more masked algos are supported, it'll make sense to keep only the masked kernels (and compute the mask outside, pass it in regardless of type). This would help avoid some of ugliness around determining nulls with stuff like datetimelike.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, better to de-duplicate in a dedicated pass anyway

na_val = 0

N, K = (<object>values).shape
seen_na = np.zeros((<object>accum).shape[0], dtype=np.uint8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we avoid allocating in some cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have changed to only allocate if a missing value is possible. Could avoid more aggressively perhaps (eg only as soon as a missing value is seen), see comment for the masked case

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine, conflict and can you rebase and ping on green just to be sure.

@@ -2784,10 +2784,11 @@ def cummin(self, axis=0, **kwargs):
-------
Series or DataFrame
"""
skipna = kwargs.get("skipna", True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i agree we want this in the signature, can be a followup is ok.

@jreback jreback merged commit d0eba0a into pandas-dev:master Jul 31, 2021
@jreback
Copy link
Contributor

jreback commented Jul 31, 2021

thanks @mzeitlin11 very nice!

@mzeitlin11 mzeitlin11 deleted the group_cummin_max_skipna branch July 31, 2021 01:25
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants