Skip to content

BUG: groupby rank computing incorrect percentiles #40575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Apr 1, 2021

Conversation

mzeitlin11
Copy link
Member

Built on #40546, is a relatively minor change after that.

@mzeitlin11 mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby labels Mar 23, 2021
@@ -983,6 +989,9 @@ def rank_1d(
else:
mask = np.zeros(shape=len(masked_vals), dtype=np.uint8)

# If ascending and na_option == 'bottom' or descending and
# na_option == 'top' -> we want to rank NaN as the highest
# so fill with the maximum value for the type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to highlight #40016 in case relevant here for consistency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for linking that, I was confused by the documentation as well haha. Will adjust this comment in #40546

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good - one comment on type selection

@@ -1001,6 +1001,7 @@ def rank_1d(
ndarray[uint8_t, ndim=1] mask
bint keep_na, at_end, next_val_diff, check_labels, group_changed
rank_t nan_fill_val
float64_t grp_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be an integer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this because grp_sizes was defined as an array of float64_t (I'd guess to avoid worrying about casting to float when dividing by grp_sizes at the end). Can change both to int if you think that would make more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense. That feels like a mistake so yea if you can change the array and nothing breaks lets do that. If it causes an issue can do in a follow up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have changed in latest commit

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@WillAyd WillAyd added this to the 1.3 milestone Mar 27, 2021
@@ -998,12 +998,13 @@ def rank_1d(
TiebreakEnumType tiebreak
Py_ssize_t i, j, N, grp_start=0, dups=0, sum_ranks=0
Py_ssize_t grp_vals_seen=1, grp_na_count=0
ndarray[int64_t, ndim=1] lexsort_indexer
ndarray[float64_t, ndim=1] grp_sizes, out
ndarray[int64_t, ndim=1] lexsort_indexer, grp_sizes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lexsort_indexer[i] is being passed to ndarray.__getitem__ -> it shouldbe intp_t

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch thanks

@jreback
Copy link
Contributor

jreback commented Mar 29, 2021

lgtm. can you merge master and ping on greenish

@mzeitlin11
Copy link
Member Author

lgtm. can you merge master and ping on greenish

@jreback greenish

@jreback jreback merged commit 67bc077 into pandas-dev:master Apr 1, 2021
@jreback
Copy link
Contributor

jreback commented Apr 1, 2021

thanks @mzeitlin11

@mzeitlin11 mzeitlin11 deleted the bug/rank_pct_nans_groupby branch April 1, 2021 16:16
vladu pushed a commit to vladu/pandas that referenced this pull request Apr 5, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Unexpected behaviour of groupby + rank(pct=True, method="dense")
5 participants