-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Series(pyarrow-backed).rank #50264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/arrays/arrow/array.py
Outdated
from pandas.core.algorithms import rank | ||
|
||
ranked = rank( | ||
self.to_numpy(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't ranked = super().rank(...)
because of the to_numpy()
call? It seems that prior it works if self
is passed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could. However, I observed this:
import pandas as pd
import pandas.core.algorithms as algos
import numpy as np
arr = pd.array(np.random.randn(10**6), dtype="float64[pyarrow]")
%timeit algos.rank(arr)
# 1.41 s ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit algos.rank(arr.to_numpy())
# 525 ms ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Probably worth a look in algos.rank
to see whats going on. Ok to do as a separate PR or should I look at that first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow up PR is good. Ideally it would be good to have the axis != 1
case fall through to super and raise there
pandas/core/arrays/arrow/array.py
Outdated
null_placement=null_placement, | ||
tiebreaker=method, | ||
) | ||
if not pa.types.is_floating(result.type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because the existing rank implementation only returns float64
? If so, I think it's okay to diverge and return whatever type pyarrow returns e.g. if rank(int64) -> int64
I think we should respect that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[updated] - my original comment was inaccurate. I just changed this to let the pyarrow behavior flow through which is uint64
. However, for method="average"
or pct=True
we'll need to convert to float64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but if pc.rank
returns pa.int64
this would convert the result to pa.float64()
I'm suggesting we just keep the pa.int64
result here i.e. remove this check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if that's what you had in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our comments got crossed and were now on the same page?
pandas/core/arrays/arrow/array.py
Outdated
result = pa.array(ranked, type=pa_type, from_pandas=True) | ||
return type(self)(result) | ||
|
||
if axis != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be combined with the above if pa_version_under9p0 or axis !=0
.
Added benefit that if the base implementation ever implements axis != 0
this arrays gets it for free
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to fall though for axis != 1
Thanks @lukemanley |
* ArrowExtensionArray._rank * gh ref * skip pyarrow tests if not installed * defer to pc.rank output types * fix test * more consistency * use pyarrow for method="average" * fix call to super * call super with axis != 0
doc/source/whatsnew/v2.0.0.rst
file if fixing a bug or adding a new feature.Perf improvement in
Series(pyarrow-backed).rank
by usingpyarrow.compute.rank
. All parameter combinations use pyarrow compute functions except formethod="average"
which falls back toalgos.rank
.