PERF: Series(pyarrow-backed).rank #50264

lukemanley · 2022-12-15T00:49:26Z

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.0.0.rst file if fixing a bug or adding a new feature.

Perf improvement in Series(pyarrow-backed).rank by using pyarrow.compute.rank. All parameter combinations use pyarrow compute functions except for method="average" which falls back to algos.rank.

import pandas as pd 
import numpy as np

ser = pd.Series(np.random.randn(10**6), dtype="float64[pyarrow]")

%timeit ser.rank(method="first")

# 1.41 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    <- main
# 148 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- PR

mroeschke · 2022-12-16T00:49:38Z

pandas/core/arrays/arrow/array.py

+            from pandas.core.algorithms import rank
+
+            ranked = rank(
+                self.to_numpy(),


Can't ranked = super().rank(...) because of the to_numpy() call? It seems that prior it works if self is passed?

Yes, we could. However, I observed this:

import pandas as pd import pandas.core.algorithms as algos import numpy as np arr = pd.array(np.random.randn(10**6), dtype="float64[pyarrow]") %timeit algos.rank(arr) # 1.41 s ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) %timeit algos.rank(arr.to_numpy()) # 525 ms ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Probably worth a look in algos.rank to see whats going on. Ok to do as a separate PR or should I look at that first?

Follow up PR is good. Ideally it would be good to have the axis != 1 case fall through to super and raise there

mroeschke · 2022-12-16T00:53:11Z

pandas/core/arrays/arrow/array.py

+            null_placement=null_placement,
+            tiebreaker=method,
+        )
+        if not pa.types.is_floating(result.type):


Is this because the existing rank implementation only returns float64? If so, I think it's okay to diverge and return whatever type pyarrow returns e.g. if rank(int64) -> int64 I think we should respect that

[updated] - my original comment was inaccurate. I just changed this to let the pyarrow behavior flow through which is uint64. However, for method="average" or pct=True we'll need to convert to float64.

Right, but if pc.rank returns pa.int64 this would convert the result to pa.float64()

I'm suggesting we just keep the pa.int64 result here i.e. remove this check

Let me know if that's what you had in mind.

I think our comments got crossed and were now on the same page?

mroeschke · 2022-12-16T18:27:19Z

pandas/core/arrays/arrow/array.py

+            result = pa.array(ranked, type=pa_type, from_pandas=True)
+            return type(self)(result)
+
+        if axis != 0:


I think this can be combined with the above if pa_version_under9p0 or axis !=0.

Added benefit that if the base implementation ever implements axis != 0 this arrays gets it for free

Updated to fall though for axis != 1

pandas/core/arrays/arrow/array.py

mroeschke · 2022-12-17T21:04:06Z

Thanks @lukemanley

* ArrowExtensionArray._rank * gh ref * skip pyarrow tests if not installed * defer to pc.rank output types * fix test * more consistency * use pyarrow for method="average" * fix call to super * call super with axis != 0

ArrowExtensionArray._rank

f7a21c2

lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Dec 15, 2022

lukemanley added 2 commits December 14, 2022 19:56

gh ref

743d90a

skip pyarrow tests if not installed

708878e

mroeschke reviewed Dec 16, 2022

View reviewed changes

lukemanley added 5 commits December 15, 2022 20:55

defer to pc.rank output types

85226f5

fix test

5d1a533

more consistency

dee82ee

use pyarrow for method="average"

6ba998e

fix call to super

a648a90

mroeschke reviewed Dec 16, 2022

View reviewed changes

pandas/core/arrays/arrow/array.py Show resolved Hide resolved

call super with axis != 0

bd90163

mroeschke approved these changes Dec 17, 2022

View reviewed changes

mroeschke added this to the 2.0 milestone Dec 17, 2022

mroeschke merged commit 8117a55 into pandas-dev:main Dec 17, 2022

lukemanley deleted the arrow-ea-rank branch December 20, 2022 00:46

jorisvandenbossche mentioned this pull request Sep 9, 2024

BUG/API (string dtype): return float dtype for series[str].rank() #59768

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Series(pyarrow-backed).rank #50264

PERF: Series(pyarrow-backed).rank #50264

lukemanley commented Dec 15, 2022

mroeschke Dec 16, 2022

lukemanley Dec 16, 2022

mroeschke Dec 16, 2022

mroeschke Dec 16, 2022 •

edited

Loading

lukemanley Dec 16, 2022 •

edited

Loading

mroeschke Dec 16, 2022

lukemanley Dec 16, 2022

lukemanley Dec 16, 2022

mroeschke Dec 16, 2022

lukemanley Dec 17, 2022

mroeschke commented Dec 17, 2022

PERF: Series(pyarrow-backed).rank #50264

PERF: Series(pyarrow-backed).rank #50264

Conversation

lukemanley commented Dec 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

lukemanley Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Dec 17, 2022

mroeschke Dec 16, 2022 •

edited

Loading

lukemanley Dec 16, 2022 •

edited

Loading