-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: DataFrame.rank with np.inf and np.nan #38681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
([NegInfinity(), "1", "A", "BA", "Ba", "C", Infinity()], "object"), | ||
], | ||
) | ||
def test_rank_inf_and_nan(self, contents, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly duplicated from
def test_rank_inf(self, contents, dtype): |
Could deduplicate some of this into a fixture in a followup if that makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would make sense - besides, the black
formatter doesn't make this look great
can you add the 2 examples from the OP as tests just to have them. |
Added the tests from the issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. cc @jbrockmendel if you can double check here. merge if ok.
cc @jbrockmendel if any comments. |
col1 = [5, 4, 3, 5, 8, 5, 2, 1, 6, 6] | ||
col2 = [5, 4, np.nan, 5, 8, 5, np.inf, np.nan, 6, -np.inf] | ||
df = DataFrame( | ||
index=index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick, can index go after data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
], | ||
"int64", | ||
marks=pytest.mark.xfail( | ||
reason="iNaT is equivalent to minimum value of dtype" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may need to pass a datetimelike keyword to rank; pretty sure we do this for other functions in that file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep makes sense, saw your comment on that issue. Handling this consistently should be easier after #38744
val = values[i, j] | ||
|
||
if rank_t is not uint64_t: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i expect we'll see a perf hit on non-float dtypes because of mask[...] lookups. is that effect small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up - good lesson for me to always check perf. Turns out mask
wasn't typed, so perf got way worse. With mask typed, looks like a small effect (but not negligible, though hard to tell with variation between successive runs). Using the added benchmark from #38744:
before after ratio
[e85d0782] [36884e67]
<master> <bug/2d_rank_inf>
6.66±0.2ms 6.61±0.2ms 0.99 frame_methods.Rank.time_rank('float')
2.24±0.1ms 2.58±0.09ms ~1.15 frame_methods.Rank.time_rank('int')
46.6±0.5ms 46.6±1ms 1.00 frame_methods.Rank.time_rank('object')
2.20±0.08ms 2.20±0.06ms 1.00 frame_methods.Rank.time_rank('uint')
On a larger (probably more stable example):
before after ratio
[e85d0782] [36884e67]
<master> <bug/2d_rank_inf>
975±20ms 972±30ms 1.00 frame_methods.Rank.time_rank('float')
+ 329±10ms 378±10ms 1.15 frame_methods.Rank.time_rank('int')
12.8±0.01s 13.4±0.3s 1.05 frame_methods.Rank.time_rank('object')
318±6ms 343±30ms 1.08 frame_methods.Rank.time_rank('uint')
(with a second run looking better):
990±20ms 984±10ms 0.99 frame_methods.Rank.time_rank('float')
345±8ms 368±4ms 1.07 frame_methods.Rank.time_rank('int')
12.9±0.2s 13.0±0.2s 1.01 frame_methods.Rank.time_rank('object')
356±10ms 339±4ms 0.95 frame_methods.Rank.time_rank('uint')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also changed if conditions to short-circuit to avoid some mask lookups
pandas/_libs/algos.pyx
Outdated
|
||
continue | ||
if mask[i, argsorted[i, j]] and keep_na: | ||
ranks[i, argsorted[i, j]] = NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could define idx = argsoted[i, j]
on L1110 to avoid doing the lookup twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
thanks @mzeitlin11 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Seemed like two distinct issues here - first
np.inf
wasn't distinguished fromnp.nan
(#32593 (comment)). The changes toskip_condition
logic handles that, but doesn't fix issue from OP. Rest of diff is to handle whennp.inf
andnp.nan
are both included.