-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Add support for min_count keyword for Resample and Groupby functions #37870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR @phofl. It seems to me that first, last, min, and max should all return nan even when |
Do you perhaps mean -1? Otherwise, I don't follow. |
My 2 cents: min_count=0 is the same as not specifying a min_count. Min_count of x says you need to have at least x non-nan values in the bin to show a result, otherwise show nan. If you have a min_count of 0 isnt that like saying show me the sum, first etc regardless how many nans are there? Maybe I am missing something? But while it is important to decide what min_count=0 should show it is less important for the implementation, you would not specify that imo, because as mentioned it is like not specifying it. I am ok with any decision. |
cc @rhshadrach When we want to return nan in this case, we can keep -1 as default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks really good, just some minor comments on tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/_libs/groupby.pyx
Outdated
@@ -1411,7 +1407,7 @@ def group_min(groupby_t[:, :] out, | |||
|
|||
for i in range(ncounts): | |||
for j in range(K): | |||
if nobs[i, j] == 0: | |||
if nobs[i, j] < min_count or nobs[i, j] == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you avoid the double lookup/comparison here by setting min_count = max(min_count, 1)
outside the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
lgtm. ping on green (resolved conflict) |
@jreback greenish. Failure is unrelated |
thanks @phofl |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This would add support for the documented keyword ```min_count``. If we decide to move forward with this, we have to decide what to do in the following case:
first
andlast
return0.0
which is fine I think and consistent withsum
, butmin
returnsinf
whilemax
returns-inf
.Default parameter is set to
1
to avoid backwards incompatible changes. We should keep this at least until 2.0If we decide to not implement
min_count
here, we should adjust the docs and the function signatures for the groupby functions.cc @rhshadrach