-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: groupby rank computing incorrect percentiles #40575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: groupby rank computing incorrect percentiles #40575
Conversation
pandas/_libs/algos.pyx
Outdated
@@ -983,6 +989,9 @@ def rank_1d( | |||
else: | |||
mask = np.zeros(shape=len(masked_vals), dtype=np.uint8) | |||
|
|||
# If ascending and na_option == 'bottom' or descending and | |||
# na_option == 'top' -> we want to rank NaN as the highest | |||
# so fill with the maximum value for the type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just want to highlight #40016 in case relevant here for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for linking that, I was confused by the documentation as well haha. Will adjust this comment in #40546
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good - one comment on type selection
pandas/_libs/algos.pyx
Outdated
@@ -1001,6 +1001,7 @@ def rank_1d( | |||
ndarray[uint8_t, ndim=1] mask | |||
bint keep_na, at_end, next_val_diff, check_labels, group_changed | |||
rank_t nan_fill_val | |||
float64_t grp_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be an integer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this because grp_sizes
was defined as an array of float64_t
(I'd guess to avoid worrying about casting to float when dividing by grp_sizes
at the end). Can change both to int if you think that would make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense. That feels like a mistake so yea if you can change the array and nothing breaks lets do that. If it causes an issue can do in a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have changed in latest commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm
pandas/_libs/algos.pyx
Outdated
@@ -998,12 +998,13 @@ def rank_1d( | |||
TiebreakEnumType tiebreak | |||
Py_ssize_t i, j, N, grp_start=0, dups=0, sum_ranks=0 | |||
Py_ssize_t grp_vals_seen=1, grp_na_count=0 | |||
ndarray[int64_t, ndim=1] lexsort_indexer | |||
ndarray[float64_t, ndim=1] grp_sizes, out | |||
ndarray[int64_t, ndim=1] lexsort_indexer, grp_sizes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lexsort_indexer[i] is being passed to ndarray.__getitem__
-> it shouldbe intp_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch thanks
lgtm. can you merge master and ping on greenish |
@jreback greenish |
thanks @mzeitlin11 |
Built on #40546, is a relatively minor change after that.