BUG: groupby rank computing incorrect percentiles #40575

mzeitlin11 · 2021-03-22T22:23:19Z

closes BUG: Unexpected behaviour of groupby + rank(pct=True, method="dense") #40518
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

Built on #40546, is a relatively minor change after that.

…roupby

attack68 · 2021-03-23T06:14:17Z

pandas/_libs/algos.pyx

@@ -983,6 +989,9 @@ def rank_1d(
    else:
        mask = np.zeros(shape=len(masked_vals), dtype=np.uint8)

+    # If ascending and na_option == 'bottom' or descending and
+    # na_option == 'top' -> we want to rank NaN as the highest
+    # so fill with the maximum value for the type


just want to highlight #40016 in case relevant here for consistency

Thanks for linking that, I was confused by the documentation as well haha. Will adjust this comment in #40546

…roupby

WillAyd

Looks pretty good - one comment on type selection

WillAyd · 2021-03-26T20:31:41Z

pandas/_libs/algos.pyx

@@ -1001,6 +1001,7 @@ def rank_1d(
        ndarray[uint8_t, ndim=1] mask
        bint keep_na, at_end, next_val_diff, check_labels, group_changed
        rank_t nan_fill_val
+        float64_t grp_size


Should this not be an integer?

Did this because grp_sizes was defined as an array of float64_t (I'd guess to avoid worrying about casting to float when dividing by grp_sizes at the end). Can change both to int if you think that would make more sense.

Ah makes sense. That feels like a mistake so yea if you can change the array and nothing breaks lets do that. If it causes an issue can do in a follow up

Have changed in latest commit

…roupby

WillAyd

Lgtm

jbrockmendel · 2021-03-27T18:18:33Z

pandas/_libs/algos.pyx

@@ -998,12 +998,13 @@ def rank_1d(
        TiebreakEnumType tiebreak
        Py_ssize_t i, j, N, grp_start=0, dups=0, sum_ranks=0
        Py_ssize_t grp_vals_seen=1, grp_na_count=0
-        ndarray[int64_t, ndim=1] lexsort_indexer
-        ndarray[float64_t, ndim=1] grp_sizes, out
+        ndarray[int64_t, ndim=1] lexsort_indexer, grp_sizes


lexsort_indexer[i] is being passed to ndarray.__getitem__ -> it shouldbe intp_t

good catch thanks

pandas/_libs/algos.pyx

jreback · 2021-03-29T14:58:05Z

lgtm. can you merge master and ping on greenish

…roupby

mzeitlin11 · 2021-03-29T18:46:02Z

lgtm. can you merge master and ping on greenish

@jreback greenish

…roupby

jreback · 2021-04-01T15:54:09Z

thanks @mzeitlin11

mzeitlin11 added 20 commits March 20, 2021 18:22

CLN: rank_1d followup

2364330

WIP

999d880

WIP

d360871

WIP

0aaeee7

Add comments, whitespace

fe6495a

Simplify conditional

8fae616

Merge remote-tracking branch 'origin/master' into cln/rank_1d

86e736b

Remove unused var

f9479e3

Avoid compiler warning

a2bea3d

Merge remote-tracking branch 'origin/master' into cln/rank_1d

ca45958

BUG: groupby rank percentile

a6fd0a5

Add tests

8a7b535

WIP

0d55593

Merge remote-tracking branch 'origin/master' into cln/rank_1d

7108e05

Simplify changes

ba5dc7c

Merge remote-tracking branch 'origin/master' into bug/rank_pct_nans_g…

680a888

…roupby

Merge branch 'cln/rank_1d' into bug/rank_pct_nans_groupby

49127f3

precommit fixup

f6a04b7

Merge branch 'cln/rank_1d' into bug/rank_pct_nans_groupby

e486ca7

Add whatsnew

268abd0

mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug Groupby labels Mar 23, 2021

attack68 reviewed Mar 23, 2021

View reviewed changes

Merge remote-tracking branch 'origin/master' into bug/rank_pct_nans_g…

aecab5f

…roupby

WillAyd requested changes Mar 26, 2021

View reviewed changes

mzeitlin11 added 2 commits March 26, 2021 20:35

Merge remote-tracking branch 'origin/master' into bug/rank_pct_nans_g…

09860e3

…roupby

Use float instead of int for grp_size

8eb56cf

WillAyd approved these changes Mar 27, 2021

View reviewed changes

WillAyd added this to the 1.3 milestone Mar 27, 2021

jbrockmendel reviewed Mar 27, 2021

View reviewed changes

pandas/_libs/algos.pyx Show resolved Hide resolved

Use intp for lexsort indexer

57e6697

Merge remote-tracking branch 'origin/master' into bug/rank_pct_nans_g…

f925124

…roupby

Merge remote-tracking branch 'origin/master' into bug/rank_pct_nans_g…

8c59323

…roupby

jreback merged commit 67bc077 into pandas-dev:master Apr 1, 2021

mzeitlin11 deleted the bug/rank_pct_nans_groupby branch April 1, 2021 16:16

vladu pushed a commit to vladu/pandas that referenced this pull request Apr 5, 2021

BUG: groupby rank computing incorrect percentiles (pandas-dev#40575)

8587695

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

BUG: groupby rank computing incorrect percentiles (pandas-dev#40575)

3ac1f3b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: groupby rank computing incorrect percentiles #40575

BUG: groupby rank computing incorrect percentiles #40575

mzeitlin11 commented Mar 22, 2021

attack68 Mar 23, 2021

mzeitlin11 Mar 23, 2021

WillAyd left a comment

WillAyd Mar 26, 2021

mzeitlin11 Mar 26, 2021

WillAyd Mar 26, 2021

mzeitlin11 Mar 27, 2021

WillAyd left a comment

jbrockmendel Mar 27, 2021

mzeitlin11 Mar 27, 2021

jreback commented Mar 29, 2021

mzeitlin11 commented Mar 29, 2021

jreback commented Apr 1, 2021

BUG: groupby rank computing incorrect percentiles #40575

BUG: groupby rank computing incorrect percentiles #40575

Conversation

mzeitlin11 commented Mar 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 29, 2021

mzeitlin11 commented Mar 29, 2021

jreback commented Apr 1, 2021