ENH: Add support for min_count keyword for Resample and Groupby functions #37870

phofl · 2020-11-15T19:50:09Z

closes BUG: downsampling with last doesn't accept "min_count" keyword #37768
closes BUG: Groupby.last accepts min_count but implementation raises #37821
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This would add support for the documented keyword ```min_count``. If we decide to move forward with this, we have to decide what to do in the following case:

df = DataFrame({"a": [1] * 3, "b": [np.nan] * 3})
result = df.groupby("a").func(min_count=0)

first and last return 0.0 which is fine I think and consistent with sum, but min returns inf while max returns -inf.

Default parameter is set to 1 to avoid backwards incompatible changes. We should keep this at least until 2.0

If we decide to not implement min_count here, we should adjust the docs and the function signatures for the groupby functions.

cc @rhshadrach

rhshadrach · 2020-11-18T03:30:30Z

Thanks for the PR @phofl. It seems to me that first, last, min, and max should all return nan even when min_count is 0. Sum behaves differently because the empty sum is 0 (likewise, the empty product is 1).

rhshadrach · 2020-11-18T03:34:40Z

Default parameter is set to 1 to avoid backwards incompatible changes. We should keep this at least until 2.0

Do you perhaps mean -1? Otherwise, I don't follow.

DiSchi123 · 2020-11-18T12:52:04Z

My 2 cents: min_count=0 is the same as not specifying a min_count. Min_count of x says you need to have at least x non-nan values in the bin to show a result, otherwise show nan. If you have a min_count of 0 isnt that like saying show me the sum, first etc regardless how many nans are there? Maybe I am missing something?

But while it is important to decide what min_count=0 should show it is less important for the implementation, you would not specify that imo, because as mentioned it is like not specifying it. I am ok with any decision.

phofl · 2020-11-22T01:24:32Z

cc @rhshadrach When we want to return nan in this case, we can keep -1 as default.

rhshadrach

Overall looks really good, just some minor comments on tests.

pandas/tests/groupby/aggregate/test_other.py

pandas/_libs/groupby.pyx

doc/source/whatsnew/v1.2.0.rst

rhshadrach

cc @jreback @jbrockmendel

jbrockmendel · 2020-11-23T03:20:48Z

pandas/_libs/groupby.pyx

@@ -1411,7 +1407,7 @@ def group_min(groupby_t[:, :] out,

        for i in range(ncounts):
            for j in range(K):
-                if nobs[i, j] == 0:
+                if nobs[i, j] < min_count or nobs[i, j] == 0:


can you avoid the double lookup/comparison here by setting min_count = max(min_count, 1) outside the loop?

jreback · 2020-11-25T23:38:57Z

lgtm. ping on green (resolved conflict)

phofl · 2020-11-26T12:55:10Z

@jreback greenish. Failure is unrelated

jreback · 2020-11-26T15:36:29Z

thanks @phofl

phofl added 3 commits November 14, 2020 19:54

Add min count to groupby functions

4bfe322

Add test

0ccda51

Add tests and whatsnew

1aff629

phofl added Groupby Resample resample method labels Nov 15, 2020

Roll tests back with new behavior

f5f333d

Fix default value

b826413

rhshadrach requested changes Nov 22, 2020

View reviewed changes

pandas/tests/groupby/aggregate/test_other.py Outdated Show resolved Hide resolved

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved

doc/source/whatsnew/v1.2.0.rst Show resolved Hide resolved

Move and improve test

9111fc9

rhshadrach approved these changes Nov 22, 2020

View reviewed changes

jbrockmendel reviewed Nov 23, 2020

View reviewed changes

Improve if condition

8654aaa

jreback added this to the 1.2 milestone Nov 25, 2020

Merge branch 'master' into groupby_min_count

0020e63

jreback merged commit 8b0af58 into pandas-dev:master Nov 26, 2020

phofl deleted the groupby_min_count branch November 26, 2020 16:03

rhshadrach mentioned this pull request Mar 31, 2021

Groupby count doesn't accept sort= keyword #28755

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add support for min_count keyword for Resample and Groupby functions #37870

ENH: Add support for min_count keyword for Resample and Groupby functions #37870

phofl commented Nov 15, 2020 •

edited

Loading

rhshadrach commented Nov 18, 2020

rhshadrach commented Nov 18, 2020

DiSchi123 commented Nov 18, 2020

phofl commented Nov 22, 2020

rhshadrach left a comment

rhshadrach left a comment

jbrockmendel Nov 23, 2020

phofl Nov 23, 2020

jreback commented Nov 25, 2020

phofl commented Nov 26, 2020

jreback commented Nov 26, 2020

ENH: Add support for min_count keyword for Resample and Groupby functions #37870

ENH: Add support for min_count keyword for Resample and Groupby functions #37870

Conversation

phofl commented Nov 15, 2020 • edited Loading

rhshadrach commented Nov 18, 2020

rhshadrach commented Nov 18, 2020

DiSchi123 commented Nov 18, 2020

phofl commented Nov 22, 2020

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

jbrockmendel Nov 23, 2020

Choose a reason for hiding this comment

phofl Nov 23, 2020

Choose a reason for hiding this comment

jreback commented Nov 25, 2020

phofl commented Nov 26, 2020

jreback commented Nov 26, 2020

phofl commented Nov 15, 2020 •

edited

Loading