Skip to content

PERF: Series(pyarrow-backed).rank #50264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 17, 2022
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -734,6 +734,7 @@ Performance improvements
- Performance improvement in :meth:`MultiIndex.isin` when ``level=None`` (:issue:`48622`, :issue:`49577`)
- Performance improvement in :meth:`MultiIndex.putmask` (:issue:`49830`)
- Performance improvement in :meth:`Index.union` and :meth:`MultiIndex.union` when index contains duplicates (:issue:`48900`)
- Performance improvement in :meth:`Series.rank` for pyarrow-backed dtypes (:issue:`50264`)
- Performance improvement in :meth:`Series.fillna` for extension array dtypes (:issue:`49722`, :issue:`50078`)
- Performance improvement for :meth:`Series.value_counts` with nullable dtype (:issue:`48338`)
- Performance improvement for :class:`Series` constructor passing integer numpy array with nullable dtype (:issue:`48338`)
Expand Down
65 changes: 64 additions & 1 deletion pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@

from pandas._typing import (
ArrayLike,
AxisInt,
Dtype,
FillnaOptions,
Iterator,
Expand All @@ -22,6 +23,7 @@
from pandas.compat import (
pa_version_under6p0,
pa_version_under7p0,
pa_version_under9p0,
)
from pandas.util._decorators import doc
from pandas.util._validators import validate_fillna_kwargs
Expand Down Expand Up @@ -949,7 +951,68 @@ def _indexing_key_to_indices(
indices = np.arange(n)[key]
return indices

# TODO: redefine _rank using pc.rank with pyarrow 9.0
def _rank(
self: ArrowExtensionArrayT,
*,
axis: AxisInt = 0,
method: str = "average",
na_option: str = "keep",
ascending: bool = True,
pct: bool = False,
) -> ArrowExtensionArrayT:
"""
See Series.rank.__doc__.
"""
if axis != 0:
raise NotImplementedError

if (
pa_version_under9p0
# as of version 10, pyarrow does not support an "average" method
or method not in ("min", "max", "first", "dense")
):
from pandas.core.algorithms import rank

ranked = rank(
self.to_numpy(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't ranked = super().rank(...) because of the to_numpy() call? It seems that prior it works if self is passed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could. However, I observed this:

import pandas as pd 
import pandas.core.algorithms as algos
import numpy as np

arr = pd.array(np.random.randn(10**6), dtype="float64[pyarrow]")

%timeit algos.rank(arr)
# 1.41 s ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit algos.rank(arr.to_numpy())
# 525 ms ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Probably worth a look in algos.rank to see whats going on. Ok to do as a separate PR or should I look at that first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up PR is good. Ideally it would be good to have the axis != 1 case fall through to super and raise there

axis=axis,
method=method,
na_option=na_option,
ascending=ascending,
pct=pct,
)
result = pa.array(ranked, type=pa.float64(), from_pandas=True)
return type(self)(result)

sort_keys = "ascending" if ascending else "descending"

if na_option == "top":
null_placement = "at_start"
else:
null_placement = "at_end"

result = pc.rank(
self._data.combine_chunks(),
sort_keys=sort_keys,
null_placement=null_placement,
tiebreaker=method,
)
if not pa.types.is_floating(result.type):
Copy link
Member

@mroeschke mroeschke Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because the existing rank implementation only returns float64? If so, I think it's okay to diverge and return whatever type pyarrow returns e.g. if rank(int64) -> int64 I think we should respect that

Copy link
Member Author

@lukemanley lukemanley Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[updated] - my original comment was inaccurate. I just changed this to let the pyarrow behavior flow through which is uint64. However, for method="average" or pct=True we'll need to convert to float64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but if pc.rank returns pa.int64 this would convert the result to pa.float64()

I'm suggesting we just keep the pa.int64 result here i.e. remove this check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if that's what you had in mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our comments got crossed and were now on the same page?

result = result.cast(pa.float64())

if na_option == "keep":
mask = pc.is_null(self._data)
null = pa.scalar(None, type=self._data.type)
result = pc.if_else(mask, null, result)

if pct:
if method == "dense":
divisor = pc.max(result)
else:
divisor = pc.count(result)
result = pc.divide(result, divisor)

return type(self)(result)

def _quantile(
self: ArrowExtensionArrayT, qs: npt.NDArray[np.float64], interpolation: str
Expand Down
2 changes: 0 additions & 2 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1576,8 +1576,6 @@ def _rank(
if axis != 0:
raise NotImplementedError

# TODO: we only have tests that get here with dt64 and td64
# TODO: all tests that get here use the defaults for all the kwds
return rank(
self,
axis=axis,
Expand Down
32 changes: 23 additions & 9 deletions pandas/tests/series/methods/test_rank.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import pandas.util._test_decorators as td

from pandas import (
NA,
NaT,
Series,
Timestamp,
Expand Down Expand Up @@ -38,6 +39,21 @@ def results(request):
return request.param


@pytest.fixture(
params=[
"object",
"float64",
"int64",
"Float64",
"Int64",
pytest.param("float64[pyarrow]", marks=td.skip_if_no("pyarrow")),
pytest.param("int64[pyarrow]", marks=td.skip_if_no("pyarrow")),
]
)
def dtype(request):
return request.param


class TestSeriesRank:
@td.skip_if_no_scipy
def test_rank(self, datetime_series):
Expand Down Expand Up @@ -238,13 +254,18 @@ def test_rank_tie_methods(self, ser, results, dtype):
[
("object", None, Infinity(), NegInfinity()),
("float64", np.nan, np.inf, -np.inf),
("Float64", NA, np.inf, -np.inf),
pytest.param(
"float64[pyarrow]", NA, np.inf, -np.inf, marks=td.skip_if_no("pyarrow")
),
],
)
def test_rank_tie_methods_on_infs_nans(
self, method, na_option, ascending, dtype, na_value, pos_inf, neg_inf
):
chunk = 3
exp_dtype = dtype if dtype == "float64[pyarrow]" else "float64"

chunk = 3
in_arr = [neg_inf] * chunk + [na_value] * chunk + [pos_inf] * chunk
iseries = Series(in_arr, dtype=dtype)
exp_ranks = {
Expand All @@ -264,7 +285,7 @@ def test_rank_tie_methods_on_infs_nans(
expected = order if ascending else order[::-1]
expected = list(chain.from_iterable(expected))
result = iseries.rank(method=method, na_option=na_option, ascending=ascending)
tm.assert_series_equal(result, Series(expected, dtype="float64"))
tm.assert_series_equal(result, Series(expected, dtype=exp_dtype))

def test_rank_desc_mix_nans_infs(self):
# GH 19538
Expand Down Expand Up @@ -299,7 +320,6 @@ def test_rank_methods_series(self, method, op, value):
expected = Series(sprank, index=index).astype("float64")
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("dtype", ["O", "f8", "i8"])
@pytest.mark.parametrize(
"ser, exp",
[
Expand All @@ -319,7 +339,6 @@ def test_rank_dense_method(self, dtype, ser, exp):
expected = Series(exp).astype(result.dtype)
tm.assert_series_equal(result, expected)

@pytest.mark.parametrize("dtype", ["O", "f8", "i8"])
def test_rank_descending(self, ser, results, dtype):
method, _ = results
if "i" in dtype:
Expand Down Expand Up @@ -365,7 +384,6 @@ def test_rank_modify_inplace(self):
# GH15630, pct should be on 100% basis when method='dense'


@pytest.mark.parametrize("dtype", ["O", "f8", "i8"])
@pytest.mark.parametrize(
"ser, exp",
[
Expand All @@ -387,7 +405,6 @@ def test_rank_dense_pct(dtype, ser, exp):
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("dtype", ["O", "f8", "i8"])
@pytest.mark.parametrize(
"ser, exp",
[
Expand All @@ -409,7 +426,6 @@ def test_rank_min_pct(dtype, ser, exp):
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("dtype", ["O", "f8", "i8"])
@pytest.mark.parametrize(
"ser, exp",
[
Expand All @@ -431,7 +447,6 @@ def test_rank_max_pct(dtype, ser, exp):
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("dtype", ["O", "f8", "i8"])
@pytest.mark.parametrize(
"ser, exp",
[
Expand All @@ -453,7 +468,6 @@ def test_rank_average_pct(dtype, ser, exp):
tm.assert_series_equal(result, expected)


@pytest.mark.parametrize("dtype", ["f8", "i8"])
@pytest.mark.parametrize(
"ser, exp",
[
Expand Down