Skip to content

BUG: DataFrame.rank with np.inf and np.nan #38681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Dec 30, 2020

Conversation

mzeitlin11
Copy link
Member

Seemed like two distinct issues here - first np.inf wasn't distinguished from np.nan (#32593 (comment)). The changes to skip_condition logic handles that, but doesn't fix issue from OP. Rest of diff is to handle when np.inf and np.nan are both included.

([NegInfinity(), "1", "A", "BA", "Ba", "C", Infinity()], "object"),
],
)
def test_rank_inf_and_nan(self, contents, dtype):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly duplicated from

def test_rank_inf(self, contents, dtype):

Could deduplicate some of this into a fixture in a followup if that makes sense

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would make sense - besides, the black formatter doesn't make this look great

@mzeitlin11 mzeitlin11 added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Dec 24, 2020
@jreback
Copy link
Contributor

jreback commented Dec 24, 2020

can you add the 2 examples from the OP as tests just to have them.

@jreback jreback added the Bug label Dec 24, 2020
@mzeitlin11
Copy link
Member Author

Added the tests from the issue

@jreback jreback added this to the 1.3 milestone Dec 27, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. cc @jbrockmendel if you can double check here. merge if ok.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2020

cc @jbrockmendel if any comments.

col1 = [5, 4, 3, 5, 8, 5, 2, 1, 6, 6]
col2 = [5, 4, np.nan, 5, 8, 5, np.inf, np.nan, 6, -np.inf]
df = DataFrame(
index=index,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, can index go after data

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

],
"int64",
marks=pytest.mark.xfail(
reason="iNaT is equivalent to minimum value of dtype"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may need to pass a datetimelike keyword to rank; pretty sure we do this for other functions in that file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep makes sense, saw your comment on that issue. Handling this consistently should be easier after #38744

val = values[i, j]

if rank_t is not uint64_t:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i expect we'll see a perf hit on non-float dtypes because of mask[...] lookups. is that effect small?

Copy link
Member Author

@mzeitlin11 mzeitlin11 Dec 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up - good lesson for me to always check perf. Turns out mask wasn't typed, so perf got way worse. With mask typed, looks like a small effect (but not negligible, though hard to tell with variation between successive runs). Using the added benchmark from #38744:

       before           after         ratio
     [e85d0782]       [36884e67]
     <master>         <bug/2d_rank_inf>
       6.66±0.2ms       6.61±0.2ms     0.99  frame_methods.Rank.time_rank('float')
       2.24±0.1ms      2.58±0.09ms    ~1.15  frame_methods.Rank.time_rank('int')
       46.6±0.5ms         46.6±1ms     1.00  frame_methods.Rank.time_rank('object')
      2.20±0.08ms      2.20±0.06ms     1.00  frame_methods.Rank.time_rank('uint')

On a larger (probably more stable example):

             before           after         ratio
     [e85d0782]       [36884e67]
     <master>         <bug/2d_rank_inf>
         975±20ms         972±30ms     1.00  frame_methods.Rank.time_rank('float')
+        329±10ms         378±10ms     1.15  frame_methods.Rank.time_rank('int')
       12.8±0.01s        13.4±0.3s     1.05  frame_methods.Rank.time_rank('object')
          318±6ms         343±30ms     1.08  frame_methods.Rank.time_rank('uint')

(with a second run looking better):

         990±20ms         984±10ms     0.99  frame_methods.Rank.time_rank('float')
          345±8ms          368±4ms     1.07  frame_methods.Rank.time_rank('int')
        12.9±0.2s        13.0±0.2s     1.01  frame_methods.Rank.time_rank('object')
         356±10ms          339±4ms     0.95  frame_methods.Rank.time_rank('uint')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also changed if conditions to short-circuit to avoid some mask lookups


continue
if mask[i, argsorted[i, j]] and keep_na:
ranks[i, argsorted[i, j]] = NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could define idx = argsoted[i, j] on L1110 to avoid doing the lookup twice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@jreback jreback merged commit 387d485 into pandas-dev:master Dec 30, 2020
@jreback
Copy link
Contributor

jreback commented Dec 30, 2020

thanks @mzeitlin11

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Dataframe.rank() produces wrong results for float columns
4 participants