Skip to content

BUG: groupby rank computing incorrect percentiles #40575

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Apr 1, 2021
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
2364330
CLN: rank_1d followup
mzeitlin11 Mar 20, 2021
999d880
WIP
mzeitlin11 Mar 20, 2021
d360871
WIP
mzeitlin11 Mar 20, 2021
0aaeee7
WIP
mzeitlin11 Mar 20, 2021
fe6495a
Add comments, whitespace
mzeitlin11 Mar 20, 2021
8fae616
Simplify conditional
mzeitlin11 Mar 21, 2021
86e736b
Merge remote-tracking branch 'origin/master' into cln/rank_1d
mzeitlin11 Mar 21, 2021
f9479e3
Remove unused var
mzeitlin11 Mar 21, 2021
a2bea3d
Avoid compiler warning
mzeitlin11 Mar 21, 2021
ca45958
Merge remote-tracking branch 'origin/master' into cln/rank_1d
mzeitlin11 Mar 21, 2021
a6fd0a5
BUG: groupby rank percentile
mzeitlin11 Mar 22, 2021
8a7b535
Add tests
mzeitlin11 Mar 22, 2021
0d55593
WIP
mzeitlin11 Mar 22, 2021
7108e05
Merge remote-tracking branch 'origin/master' into cln/rank_1d
mzeitlin11 Mar 22, 2021
ba5dc7c
Simplify changes
mzeitlin11 Mar 22, 2021
680a888
Merge remote-tracking branch 'origin/master' into bug/rank_pct_nans_g…
mzeitlin11 Mar 22, 2021
49127f3
Merge branch 'cln/rank_1d' into bug/rank_pct_nans_groupby
mzeitlin11 Mar 22, 2021
f6a04b7
precommit fixup
mzeitlin11 Mar 22, 2021
e486ca7
Merge branch 'cln/rank_1d' into bug/rank_pct_nans_groupby
mzeitlin11 Mar 22, 2021
268abd0
Add whatsnew
mzeitlin11 Mar 22, 2021
aecab5f
Merge remote-tracking branch 'origin/master' into bug/rank_pct_nans_g…
mzeitlin11 Mar 23, 2021
09860e3
Merge remote-tracking branch 'origin/master' into bug/rank_pct_nans_g…
mzeitlin11 Mar 27, 2021
8eb56cf
Use float instead of int for grp_size
mzeitlin11 Mar 27, 2021
57e6697
Use intp for lexsort indexer
mzeitlin11 Mar 27, 2021
f925124
Merge remote-tracking branch 'origin/master' into bug/rank_pct_nans_g…
mzeitlin11 Mar 29, 2021
8c59323
Merge remote-tracking branch 'origin/master' into bug/rank_pct_nans_g…
mzeitlin11 Mar 31, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -502,6 +502,7 @@ Numeric
- Bug in :func:`select_dtypes` different behavior between Windows and Linux with ``include="int"`` (:issue:`36569`)
- Bug in :meth:`DataFrame.apply` and :meth:`DataFrame.agg` when passed argument ``func="size"`` would operate on the entire ``DataFrame`` instead of rows or columns (:issue:`39934`)
- Bug in :meth:`DataFrame.transform` would raise ``SpecificationError`` when passed a dictionary and columns were missing; will now raise a ``KeyError`` instead (:issue:`40004`)
- Bug in :meth:`DataFrameGroupBy.rank` giving incorrect results with ``pct=True`` and equal values between consecutive groups (:issue:`40518`)
-

Conversion
Expand Down
188 changes: 104 additions & 84 deletions pandas/_libs/algos.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -946,22 +946,28 @@ def rank_1d(
cdef:
TiebreakEnumType tiebreak
Py_ssize_t i, j, N, grp_start=0, dups=0, sum_ranks=0
Py_ssize_t grp_vals_seen=1, grp_na_count=0, grp_tie_count=0
Py_ssize_t grp_vals_seen=1, grp_na_count=0
ndarray[int64_t, ndim=1] lexsort_indexer
ndarray[float64_t, ndim=1] grp_sizes, out
ndarray[rank_t, ndim=1] masked_vals
ndarray[uint8_t, ndim=1] mask
bint keep_na, at_end, next_val_diff, check_labels
bint keep_na, at_end, next_val_diff, check_labels, group_changed
rank_t nan_fill_val
float64_t grp_size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not be an integer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this because grp_sizes was defined as an array of float64_t (I'd guess to avoid worrying about casting to float when dividing by grp_sizes at the end). Can change both to int if you think that would make more sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense. That feels like a mistake so yea if you can change the array and nothing breaks lets do that. If it causes an issue can do in a follow up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have changed in latest commit


tiebreak = tiebreakers[ties_method]
if tiebreak == TIEBREAK_FIRST:
if not ascending:
tiebreak = TIEBREAK_FIRST_DESCENDING

keep_na = na_option == 'keep'

N = len(values)
# TODO Cython 3.0: cast won't be necessary (#2992)
assert <Py_ssize_t>len(labels) == N
out = np.empty(N)
grp_sizes = np.ones(N)

# If all 0 labels, can short-circuit later label
# comparisons
check_labels = np.any(labels)
Expand All @@ -983,6 +989,9 @@ def rank_1d(
else:
mask = np.zeros(shape=len(masked_vals), dtype=np.uint8)

# If ascending and na_option == 'bottom' or descending and
# na_option == 'top' -> we want to rank NaN as the highest
# so fill with the maximum value for the type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to highlight #40016 in case relevant here for consistency

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for linking that, I was confused by the documentation as well haha. Will adjust this comment in #40546

if ascending ^ (na_option == 'top'):
if rank_t is object:
nan_fill_val = Infinity()
Expand All @@ -993,6 +1002,8 @@ def rank_1d(
else:
nan_fill_val = np.inf
order = (masked_vals, mask, labels)

# Otherwise, fill with the lowest value of the type
else:
if rank_t is object:
nan_fill_val = NegInfinity()
Expand Down Expand Up @@ -1025,36 +1036,36 @@ def rank_1d(
if rank_t is object:
for i in range(N):
at_end = i == N - 1

# dups and sum_ranks will be incremented each loop where
# the value / group remains the same, and should be reset
# when either of those change
# Used to calculate tiebreakers
# when either of those change. Used to calculate tiebreakers
dups += 1
sum_ranks += i - grp_start + 1

next_val_diff = at_end or (masked_vals[lexsort_indexer[i]] !=
masked_vals[lexsort_indexer[i+1]])

# We'll need this check later anyway to determine group size, so just
# compute it here since shortcircuiting won't help
group_changed = at_end or (check_labels and
(labels[lexsort_indexer[i]]
!= labels[lexsort_indexer[i+1]]))

# Update out only when there is a transition of values or labels.
# When a new value or group is encountered, go back #dups steps(
# the number of occurrence of current value) and assign the ranks
# based on the starting index of the current group (grp_start)
# and the current index
if not at_end:
next_val_diff = are_diff(masked_vals[lexsort_indexer[i]],
masked_vals[lexsort_indexer[i+1]])
else:
next_val_diff = True

if (next_val_diff
or (mask[lexsort_indexer[i]] ^ mask[lexsort_indexer[i+1]])
or (check_labels
and (labels[lexsort_indexer[i]]
!= labels[lexsort_indexer[i+1]]))
):
# if keep_na, check for missing values and assign back
if (next_val_diff or group_changed
or (mask[lexsort_indexer[i]] ^ mask[lexsort_indexer[i+1]])):

# If keep_na, check for missing values and assign back
# to the result where appropriate
if keep_na and mask[lexsort_indexer[i]]:
grp_na_count = dups
for j in range(i - dups + 1, i + 1):
out[lexsort_indexer[j]] = NaN
grp_na_count = dups
elif tiebreak == TIEBREAK_AVERAGE:
for j in range(i - dups + 1, i + 1):
out[lexsort_indexer[j]] = sum_ranks / <float64_t>dups
Expand All @@ -1066,82 +1077,86 @@ def rank_1d(
out[lexsort_indexer[j]] = i - grp_start + 1
elif tiebreak == TIEBREAK_FIRST:
for j in range(i - dups + 1, i + 1):
if ascending:
out[lexsort_indexer[j]] = j + 1 - grp_start
else:
out[lexsort_indexer[j]] = 2 * i - j - dups + 2 - grp_start
out[lexsort_indexer[j]] = j + 1 - grp_start
elif tiebreak == TIEBREAK_FIRST_DESCENDING:
for j in range(i - dups + 1, i + 1):
out[lexsort_indexer[j]] = 2 * i - j - dups + 2 - grp_start
elif tiebreak == TIEBREAK_DENSE:
for j in range(i - dups + 1, i + 1):
out[lexsort_indexer[j]] = grp_vals_seen

# look forward to the next value (using the sorting in _as)
# if the value does not equal the current value then we need to
# reset the dups and sum_ranks, knowing that a new value is
# coming up. the conditional also needs to handle nan equality
# and the end of iteration
if next_val_diff or (mask[lexsort_indexer[i]]
^ mask[lexsort_indexer[i+1]]):
# Look forward to the next value (using the sorting in
# lexsort_indexer). If the value does not equal the current
# value then we need to reset the dups and sum_ranks, knowing
# that a new value is coming up. The conditional also needs
# to handle nan equality and the end of iteration. If group
# changes we do not record seeing a new value in the group
if not group_changed and (next_val_diff or
(mask[lexsort_indexer[i]]
^ mask[lexsort_indexer[i+1]])):
dups = sum_ranks = 0
grp_vals_seen += 1
grp_tie_count += 1

# Similar to the previous conditional, check now if we are
# moving to a new group. If so, keep track of the index where
# the new group occurs, so the tiebreaker calculations can
# decrement that from their position. fill in the size of each
# group encountered (used by pct calculations later). also be
# decrement that from their position. Fill in the size of each
# group encountered (used by pct calculations later). Also be
# sure to reset any of the items helping to calculate dups
if (at_end or
(check_labels
and (labels[lexsort_indexer[i]]
!= labels[lexsort_indexer[i+1]]))):
if group_changed:

# If not dense tiebreak, group size used to compute
# percentile will be # of non-null elements in group
if tiebreak != TIEBREAK_DENSE:
for j in range(grp_start, i + 1):
grp_sizes[lexsort_indexer[j]] = \
(i - grp_start + 1 - grp_na_count)
grp_size = i - grp_start + 1 - grp_na_count

# Otherwise, it will be the number of distinct values
# in the group, subtracting 1 if NaNs are present
# since that is a distinct value we shouldn't count
else:
for j in range(grp_start, i + 1):
grp_sizes[lexsort_indexer[j]] = \
(grp_tie_count - (grp_na_count > 0))
grp_size = grp_vals_seen - (grp_na_count > 0)

for j in range(grp_start, i + 1):
grp_sizes[lexsort_indexer[j]] = grp_size

dups = sum_ranks = 0
grp_na_count = 0
grp_tie_count = 0
grp_start = i + 1
grp_vals_seen = 1
else:
with nogil:
for i in range(N):
at_end = i == N - 1

# dups and sum_ranks will be incremented each loop where
# the value / group remains the same, and should be reset
# when either of those change
# Used to calculate tiebreakers
# when either of those change. Used to calculate tiebreakers
dups += 1
sum_ranks += i - grp_start + 1

next_val_diff = at_end or (masked_vals[lexsort_indexer[i]] !=
masked_vals[lexsort_indexer[i+1]])

# We'll need this check later anyway to determine group size, so just
# compute it here since shortcircuiting won't help
group_changed = at_end or (check_labels and
(labels[lexsort_indexer[i]]
!= labels[lexsort_indexer[i+1]]))

# Update out only when there is a transition of values or labels.
# When a new value or group is encountered, go back #dups steps(
# the number of occurrence of current value) and assign the ranks
# based on the starting index of the current group (grp_start)
# and the current index
if not at_end:
next_val_diff = (masked_vals[lexsort_indexer[i]]
!= masked_vals[lexsort_indexer[i+1]])
else:
next_val_diff = True

if (next_val_diff
or (mask[lexsort_indexer[i]] ^ mask[lexsort_indexer[i+1]])
or (check_labels
and (labels[lexsort_indexer[i]]
!= labels[lexsort_indexer[i+1]]))
):
# if keep_na, check for missing values and assign back
if (next_val_diff or group_changed
or (mask[lexsort_indexer[i]] ^ mask[lexsort_indexer[i+1]])):

# If keep_na, check for missing values and assign back
# to the result where appropriate
if keep_na and mask[lexsort_indexer[i]]:
grp_na_count = dups
for j in range(i - dups + 1, i + 1):
out[lexsort_indexer[j]] = NaN
grp_na_count = dups
elif tiebreak == TIEBREAK_AVERAGE:
for j in range(i - dups + 1, i + 1):
out[lexsort_indexer[j]] = sum_ranks / <float64_t>dups
Expand All @@ -1153,46 +1168,51 @@ def rank_1d(
out[lexsort_indexer[j]] = i - grp_start + 1
elif tiebreak == TIEBREAK_FIRST:
for j in range(i - dups + 1, i + 1):
if ascending:
out[lexsort_indexer[j]] = j + 1 - grp_start
else:
out[lexsort_indexer[j]] = \
(2 * i - j - dups + 2 - grp_start)
out[lexsort_indexer[j]] = j + 1 - grp_start
elif tiebreak == TIEBREAK_FIRST_DESCENDING:
for j in range(i - dups + 1, i + 1):
out[lexsort_indexer[j]] = 2 * i - j - dups + 2 - grp_start
elif tiebreak == TIEBREAK_DENSE:
for j in range(i - dups + 1, i + 1):
out[lexsort_indexer[j]] = grp_vals_seen

# look forward to the next value (using the sorting in
# lexsort_indexer) if the value does not equal the current
# value then we need to reset the dups and sum_ranks,
# knowing that a new value is coming up. the conditional
# also needs to handle nan equality and the end of iteration
if next_val_diff or (mask[lexsort_indexer[i]]
^ mask[lexsort_indexer[i+1]]):
# Look forward to the next value (using the sorting in
# lexsort_indexer). If the value does not equal the current
# value then we need to reset the dups and sum_ranks, knowing
# that a new value is coming up. The conditional also needs
# to handle nan equality and the end of iteration. If group
# changes we do not record seeing a new value in the group
if not group_changed and (next_val_diff or
(mask[lexsort_indexer[i]]
^ mask[lexsort_indexer[i+1]])):

dups = sum_ranks = 0
grp_vals_seen += 1
grp_tie_count += 1

# Similar to the previous conditional, check now if we are
# moving to a new group. If so, keep track of the index where
# the new group occurs, so the tiebreaker calculations can
# decrement that from their position. fill in the size of each
# group encountered (used by pct calculations later). also be
# decrement that from their position. Fill in the size of each
# group encountered (used by pct calculations later). Also be
# sure to reset any of the items helping to calculate dups
if at_end or (check_labels and
(labels[lexsort_indexer[i]]
!= labels[lexsort_indexer[i+1]])):
if group_changed:

# If not dense tiebreak, group size used to compute
# percentile will be # of non-null elements in group
if tiebreak != TIEBREAK_DENSE:
for j in range(grp_start, i + 1):
grp_sizes[lexsort_indexer[j]] = \
(i - grp_start + 1 - grp_na_count)
grp_size = i - grp_start + 1 - grp_na_count

# Otherwise, it will be the number of distinct values
# in the group, subtracting 1 if NaNs are present
# since that is a distinct value we shouldn't count
else:
for j in range(grp_start, i + 1):
grp_sizes[lexsort_indexer[j]] = \
(grp_tie_count - (grp_na_count > 0))
grp_size = grp_vals_seen - (grp_na_count > 0)

for j in range(grp_start, i + 1):
grp_sizes[lexsort_indexer[j]] = grp_size

dups = sum_ranks = 0
grp_na_count = 0
grp_tie_count = 0
grp_start = i + 1
grp_vals_seen = 1

Expand Down
25 changes: 25 additions & 0 deletions pandas/tests/groupby/test_rank.py
Original file line number Diff line number Diff line change
Expand Up @@ -517,3 +517,28 @@ def test_rank_zero_div(input_key, input_value, output_value):
result = df.groupby("A").rank(method="dense", pct=True)
expected = DataFrame({"B": output_value})
tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize("use_nan", [True, False])
def test_rank_pct_equal_values_on_group_transition(use_nan):
# GH#40518
fill_value = np.nan if use_nan else 3
df = DataFrame(
[
[-1, 1],
[-1, 2],
[1, fill_value],
[-1, fill_value],
],
columns=["group", "val"],
)
result = df.groupby(["group"])["val"].rank(
method="dense",
pct=True,
)
if use_nan:
expected = Series([0.5, 1, np.nan, np.nan], name="val")
else:
expected = Series([1 / 3, 2 / 3, 1, 1], name="val")

tm.assert_series_equal(result, expected)