Skip to content

PERF: Series(pyarrow-backed).rank #50264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 17, 2022
Merged

Conversation

lukemanley
Copy link
Member

Perf improvement in Series(pyarrow-backed).rank by using pyarrow.compute.rank. All parameter combinations use pyarrow compute functions except for method="average" which falls back to algos.rank.

import pandas as pd 
import numpy as np

ser = pd.Series(np.random.randn(10**6), dtype="float64[pyarrow]")

%timeit ser.rank(method="first")

# 1.41 s ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)    <- main
# 148 ms ± 1.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <- PR

@lukemanley lukemanley added Performance Memory or execution speed performance Arrow pyarrow functionality labels Dec 15, 2022
from pandas.core.algorithms import rank

ranked = rank(
self.to_numpy(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't ranked = super().rank(...) because of the to_numpy() call? It seems that prior it works if self is passed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could. However, I observed this:

import pandas as pd 
import pandas.core.algorithms as algos
import numpy as np

arr = pd.array(np.random.randn(10**6), dtype="float64[pyarrow]")

%timeit algos.rank(arr)
# 1.41 s ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit algos.rank(arr.to_numpy())
# 525 ms ± 9.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Probably worth a look in algos.rank to see whats going on. Ok to do as a separate PR or should I look at that first?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up PR is good. Ideally it would be good to have the axis != 1 case fall through to super and raise there

null_placement=null_placement,
tiebreaker=method,
)
if not pa.types.is_floating(result.type):
Copy link
Member

@mroeschke mroeschke Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because the existing rank implementation only returns float64? If so, I think it's okay to diverge and return whatever type pyarrow returns e.g. if rank(int64) -> int64 I think we should respect that

Copy link
Member Author

@lukemanley lukemanley Dec 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[updated] - my original comment was inaccurate. I just changed this to let the pyarrow behavior flow through which is uint64. However, for method="average" or pct=True we'll need to convert to float64.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but if pc.rank returns pa.int64 this would convert the result to pa.float64()

I'm suggesting we just keep the pa.int64 result here i.e. remove this check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if that's what you had in mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our comments got crossed and were now on the same page?

result = pa.array(ranked, type=pa_type, from_pandas=True)
return type(self)(result)

if axis != 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be combined with the above if pa_version_under9p0 or axis !=0.

Added benefit that if the base implementation ever implements axis != 0 this arrays gets it for free

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to fall though for axis != 1

@mroeschke mroeschke added this to the 2.0 milestone Dec 17, 2022
@mroeschke mroeschke merged commit 8117a55 into pandas-dev:main Dec 17, 2022
@mroeschke
Copy link
Member

Thanks @lukemanley

phofl pushed a commit to phofl/pandas that referenced this pull request Dec 17, 2022
* ArrowExtensionArray._rank

* gh ref

* skip pyarrow tests if not installed

* defer to pc.rank output types

* fix test

* more consistency

* use pyarrow for method="average"

* fix call to super

* call super with axis != 0
@lukemanley lukemanley deleted the arrow-ea-rank branch December 20, 2022 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants