Skip to content

BUG: DataFrame.rank with np.inf and np.nan #38681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Dec 30, 2020
3 changes: 3 additions & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,9 @@ Numeric
^^^^^^^
- Bug in :meth:`DataFrame.quantile`, :meth:`DataFrame.sort_values` causing incorrect subsequent indexing behavior (:issue:`38351`)
- Bug in :meth:`DataFrame.select_dtypes` with ``include=np.number`` now retains numeric ``ExtensionDtype`` columns (:issue:`35340`)
- Bug in :meth:`DataFrame.rank` with ``np.inf`` and mixture of ``np.nan`` and ``np.inf`` (:issue:`32593`)
-
-

Conversion
^^^^^^^^^^
Expand Down
44 changes: 19 additions & 25 deletions pandas/_libs/algos.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -1067,6 +1067,8 @@ def rank_2d(
mask = values == NPY_NAT

np.putmask(values, mask, nan_value)
else:
mask = np.zeros_like(values, dtype=bool)

n, k = (<object>values).shape
ranks = np.empty((n, k), dtype='f8')
Expand Down Expand Up @@ -1099,43 +1101,35 @@ def rank_2d(
argsorted = _as.astype('i8')

for i in range(n):
if rank_t is object:
dups = sum_ranks = infs = 0
else:
dups = sum_ranks = 0
dups = sum_ranks = infs = 0

total_tie_count = 0
count = 0.0
for j in range(k):
if rank_t is not object:
sum_ranks += j + 1
dups += 1

val = values[i, j]

if rank_t is not uint64_t:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i expect we'll see a perf hit on non-float dtypes because of mask[...] lookups. is that effect small?

Copy link
Member Author

@mzeitlin11 mzeitlin11 Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up - good lesson for me to always check perf. Turns out mask wasn't typed, so perf got way worse. With mask typed, looks like a small effect (but not negligible, though hard to tell with variation between successive runs). Using the added benchmark from #38744:

       before           after         ratio
     [e85d0782]       [36884e67]
     <master>         <bug/2d_rank_inf>
       6.66±0.2ms       6.61±0.2ms     0.99  frame_methods.Rank.time_rank('float')
       2.24±0.1ms      2.58±0.09ms    ~1.15  frame_methods.Rank.time_rank('int')
       46.6±0.5ms         46.6±1ms     1.00  frame_methods.Rank.time_rank('object')
      2.20±0.08ms      2.20±0.06ms     1.00  frame_methods.Rank.time_rank('uint')

On a larger (probably more stable example):

             before           after         ratio
     [e85d0782]       [36884e67]
     <master>         <bug/2d_rank_inf>
         975±20ms         972±30ms     1.00  frame_methods.Rank.time_rank('float')
+        329±10ms         378±10ms     1.15  frame_methods.Rank.time_rank('int')
       12.8±0.01s        13.4±0.3s     1.05  frame_methods.Rank.time_rank('object')
          318±6ms         343±30ms     1.08  frame_methods.Rank.time_rank('uint')

(with a second run looking better):

         990±20ms         984±10ms     0.99  frame_methods.Rank.time_rank('float')
          345±8ms          368±4ms     1.07  frame_methods.Rank.time_rank('int')
        12.9±0.2s        13.0±0.2s     1.01  frame_methods.Rank.time_rank('object')
         356±10ms          339±4ms     0.95  frame_methods.Rank.time_rank('uint')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also changed if conditions to short-circuit to avoid some mask lookups

if rank_t is object:
skip_condition = (val is nan_value) and keep_na
else:
skip_condition = (val == nan_value) and keep_na
if skip_condition:
ranks[i, argsorted[i, j]] = NaN

if rank_t is object:
infs += 1

continue
if mask[i, argsorted[i, j]] and keep_na:
ranks[i, argsorted[i, j]] = NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could define idx = argsoted[i, j] on L1110 to avoid doing the lookup twice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

infs += 1
continue

count += 1.0

if rank_t is object:
sum_ranks += (j - infs) + 1
dups += 1
sum_ranks += (j - infs) + 1
dups += 1

if rank_t is object:
condition = j == k - 1 or are_diff(values[i, j + 1], val)
condition = (
j == k - 1 or
are_diff(values[i, j + 1], val) or
(mask[i, argsorted[i, j + 1]] and keep_na)
)
else:
condition = j == k - 1 or values[i, j + 1] != val
condition = (
j == k - 1 or
values[i, j + 1] != val or
(mask[i, argsorted[i, j + 1]] and keep_na)
)

if condition:
if tiebreak == TIEBREAK_AVERAGE:
Expand Down
87 changes: 87 additions & 0 deletions pandas/tests/frame/methods/test_rank.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
import numpy as np
import pytest

from pandas._libs import iNaT
from pandas._libs.algos import Infinity, NegInfinity
import pandas.util._test_decorators as td

from pandas import DataFrame, Series
Expand Down Expand Up @@ -329,3 +331,88 @@ def test_pct_max_many_rows(self):
)
result = df.rank(pct=True).max()
assert (result == 1).all()

@pytest.mark.parametrize(
"contents,dtype",
[
(
[
-np.inf,
-50,
-1,
-1e-20,
-1e-25,
-1e-50,
0,
1e-40,
1e-20,
1e-10,
2,
40,
np.inf,
],
"float64",
),
(
[
-np.inf,
-50,
-1,
-1e-20,
-1e-25,
-1e-45,
0,
1e-40,
1e-20,
1e-10,
2,
40,
np.inf,
],
"float32",
),
([np.iinfo(np.uint8).min, 1, 2, 100, np.iinfo(np.uint8).max], "uint8"),
pytest.param(
[
np.iinfo(np.int64).min,
-100,
0,
1,
9999,
100000,
1e10,
np.iinfo(np.int64).max,
],
"int64",
marks=pytest.mark.xfail(
reason="iNaT is equivalent to minimum value of dtype"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may need to pass a datetimelike keyword to rank; pretty sure we do this for other functions in that file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep makes sense, saw your comment on that issue. Handling this consistently should be easier after #38744

"int64 pending issue GH#16674"
),
),
([NegInfinity(), "1", "A", "BA", "Ba", "C", Infinity()], "object"),
],
)
def test_rank_inf_and_nan(self, contents, dtype):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly duplicated from

def test_rank_inf(self, contents, dtype):

Could deduplicate some of this into a fixture in a followup if that makes sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would make sense - besides, the black formatter doesn't make this look great

dtype_na_map = {
"float64": np.nan,
"float32": np.nan,
"int64": iNaT,
"object": None,
}
# Insert nans at random positions if underlying dtype has missing
# value. Then adjust the expected order by adding nans accordingly
# This is for testing whether rank calculation is affected
# when values are interwined with nan values.
values = np.array(contents, dtype=dtype)
exp_order = np.array(range(len(values)), dtype="float64") + 1.0
if dtype in dtype_na_map:
na_value = dtype_na_map[dtype]
nan_indices = np.random.choice(range(len(values)), 5)
values = np.insert(values, nan_indices, na_value)
exp_order = np.insert(exp_order, nan_indices, np.nan)
# shuffle the testing array and expected results in the same way
random_order = np.random.permutation(len(values))
df = DataFrame({"a": values[random_order]})
expected = DataFrame({"a": exp_order[random_order]}, dtype="float64")
result = df.rank()
tm.assert_frame_equal(result, expected)