BUG: DataFrame.rank with np.inf and np.nan #38681

mzeitlin11 · 2020-12-24T15:03:41Z

closes [BUG] Dataframe.rank() produces wrong results for float columns #32593
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Seemed like two distinct issues here - first np.inf wasn't distinguished from np.nan (#32593 (comment)). The changes to skip_condition logic handles that, but doesn't fix issue from OP. Rest of diff is to handle when np.inf and np.nan are both included.

mzeitlin11 · 2020-12-24T15:07:23Z

pandas/tests/frame/methods/test_rank.py

+            ([NegInfinity(), "1", "A", "BA", "Ba", "C", Infinity()], "object"),
+        ],
+    )
+    def test_rank_inf_and_nan(self, contents, dtype):


Mostly duplicated from

pandas/pandas/tests/series/methods/test_rank.py

Line 269 in 0805043

def test_rank_inf(self, contents, dtype):

Could deduplicate some of this into a fixture in a followup if that makes sense

I think that would make sense - besides, the black formatter doesn't make this look great

jreback · 2020-12-24T20:49:53Z

can you add the 2 examples from the OP as tests just to have them.

mzeitlin11 · 2020-12-25T00:38:29Z

Added the tests from the issue

jreback

lgtm. cc @jbrockmendel if you can double check here. merge if ok.

jreback · 2020-12-29T17:02:09Z

cc @jbrockmendel if any comments.

jbrockmendel · 2020-12-30T03:01:56Z

pandas/tests/frame/methods/test_rank.py

+        col1 = [5, 4, 3, 5, 8, 5, 2, 1, 6, 6]
+        col2 = [5, 4, np.nan, 5, 8, 5, np.inf, np.nan, 6, -np.inf]
+        df = DataFrame(
+            index=index,


nitpick, can index go after data

jbrockmendel · 2020-12-30T03:02:50Z

pandas/tests/frame/methods/test_rank.py

+                ],
+                "int64",
+                marks=pytest.mark.xfail(
+                    reason="iNaT is equivalent to minimum value of dtype"


may need to pass a datetimelike keyword to rank; pretty sure we do this for other functions in that file

Yep makes sense, saw your comment on that issue. Handling this consistently should be easier after #38744

jbrockmendel · 2020-12-30T03:04:25Z

pandas/_libs/algos.pyx

            val = values[i, j]

-            if rank_t is not uint64_t:


i expect we'll see a perf hit on non-float dtypes because of mask[...] lookups. is that effect small?

Thanks for bringing this up - good lesson for me to always check perf. Turns out mask wasn't typed, so perf got way worse. With mask typed, looks like a small effect (but not negligible, though hard to tell with variation between successive runs). Using the added benchmark from #38744:

before after ratio [e85d0782] [36884e67] <master> <bug/2d_rank_inf> 6.66±0.2ms 6.61±0.2ms 0.99 frame_methods.Rank.time_rank('float') 2.24±0.1ms 2.58±0.09ms ~1.15 frame_methods.Rank.time_rank('int') 46.6±0.5ms 46.6±1ms 1.00 frame_methods.Rank.time_rank('object') 2.20±0.08ms 2.20±0.06ms 1.00 frame_methods.Rank.time_rank('uint')

On a larger (probably more stable example):

before after ratio [e85d0782] [36884e67] <master> <bug/2d_rank_inf> 975±20ms 972±30ms 1.00 frame_methods.Rank.time_rank('float') + 329±10ms 378±10ms 1.15 frame_methods.Rank.time_rank('int') 12.8±0.01s 13.4±0.3s 1.05 frame_methods.Rank.time_rank('object') 318±6ms 343±30ms 1.08 frame_methods.Rank.time_rank('uint')

(with a second run looking better):

990±20ms 984±10ms 0.99 frame_methods.Rank.time_rank('float') 345±8ms 368±4ms 1.07 frame_methods.Rank.time_rank('int') 12.9±0.2s 13.0±0.2s 1.01 frame_methods.Rank.time_rank('object') 356±10ms 339±4ms 0.95 frame_methods.Rank.time_rank('uint')

Also changed if conditions to short-circuit to avoid some mask lookups

jbrockmendel · 2020-12-30T03:04:55Z

pandas/_libs/algos.pyx

-
-                    continue
+            if mask[i, argsorted[i, j]] and keep_na:
+                ranks[i, argsorted[i, j]] = NaN


could define idx = argsoted[i, j] on L1110 to avoid doing the lookup twice

jreback · 2020-12-30T21:11:32Z

thanks @mzeitlin11

mzeitlin11 added 5 commits December 24, 2020 09:33

BUG: 2d rank infs and nans

975de84

Add whatsnew

fd358b1

Remove added blank line

256ff84

Add back other blank line

f7bc058

Clean diff

39df107

mzeitlin11 commented Dec 24, 2020

View reviewed changes

mzeitlin11 added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Dec 24, 2020

jreback added the Bug label Dec 24, 2020

Add tests from OP

7abd460

jreback added this to the 1.3 milestone Dec 27, 2020

jreback approved these changes Dec 27, 2020

View reviewed changes

Merge branch 'master' into bug/2d_rank_inf

ee8b95d

jbrockmendel reviewed Dec 30, 2020

View reviewed changes

mzeitlin11 added 7 commits December 30, 2020 10:51

Fix merge conflict

34bbcb9

Avoid repeated lookup

b722d19

Fix merge

acbc2bd

Move index after data

9d6ee7f

Type mask, shortcircuit

36884e6

Remove dup whatsnew

e60aaf0

Name check more clearly

732c482

jreback merged commit 387d485 into pandas-dev:master Dec 30, 2020

mzeitlin11 deleted the bug/2d_rank_inf branch December 30, 2020 21:57

mzeitlin11 mentioned this pull request Jan 2, 2021

TST/CLN: deduplicate troublesome rank values #38894

Merged

jreback mentioned this pull request Jan 7, 2021

BUG: DataFrame.rank() wrong result for inf #39015

Closed

3 tasks

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

BUG: DataFrame.rank with np.inf and np.nan (pandas-dev#38681)

f1c78f7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.rank with np.inf and np.nan #38681

BUG: DataFrame.rank with np.inf and np.nan #38681

mzeitlin11 commented Dec 24, 2020

mzeitlin11 Dec 24, 2020

MarcoGorelli Dec 25, 2020

jreback commented Dec 24, 2020 •

edited

Loading

mzeitlin11 commented Dec 25, 2020

jreback left a comment

jreback commented Dec 29, 2020

jbrockmendel Dec 30, 2020

mzeitlin11 Dec 30, 2020

jbrockmendel Dec 30, 2020

mzeitlin11 Dec 30, 2020

jbrockmendel Dec 30, 2020

mzeitlin11 Dec 30, 2020 •

edited

Loading

mzeitlin11 Dec 30, 2020

jbrockmendel Dec 30, 2020

mzeitlin11 Dec 30, 2020

jreback commented Dec 30, 2020

BUG: DataFrame.rank with np.inf and np.nan #38681

BUG: DataFrame.rank with np.inf and np.nan #38681

Conversation

mzeitlin11 commented Dec 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 24, 2020 • edited Loading

mzeitlin11 commented Dec 25, 2020

jreback left a comment

Choose a reason for hiding this comment

jreback commented Dec 29, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mzeitlin11 Dec 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 30, 2020

jreback commented Dec 24, 2020 •

edited

Loading

mzeitlin11 Dec 30, 2020 •

edited

Loading