Skip to content

BUG(?): ArrowExtensionArray.median is an "approximate median" #52679

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lukemanley opened this issue Apr 15, 2023 · 4 comments · Fixed by #52765
Closed

BUG(?): ArrowExtensionArray.median is an "approximate median" #52679

lukemanley opened this issue Apr 15, 2023 · 4 comments · Fixed by #52765
Assignees
Labels
Arrow pyarrow functionality Bug Reduction Operations sum, mean, min, max, etc.

Comments

@lukemanley
Copy link
Member

lukemanley commented Apr 15, 2023

ArrowExtensionArray.median is implemented using pyarrow.compute.approximate_median which might produce surprising results for users looking for a "true" median:

In [1]: import pandas as pd

In [2]: pd.Series([1, 2], dtype="float64").median()
Out[2]: 1.5

In [3]: pd.Series([1, 2], dtype="float64[pyarrow]").median()
Out[3]: 1.0

It does not look like pyarrow provides a compute function for a "true" median. Is this intentional? Should it be documented somehow?

cc @mroeschke

@lukemanley lukemanley added the Arrow pyarrow functionality label Apr 15, 2023
@lukemanley lukemanley changed the title BUG(?): ArrowExtensionArray.median is an "approximate_median" BUG(?): ArrowExtensionArray.median is an "approximate median" Apr 15, 2023
@lukemanley
Copy link
Member Author

It seems like this might work to get a true median:

In [4]: pd.Series([1, 2], dtype="float64[pyarrow]").quantile(0.5)
Out[4]: 1.5

@mroeschke
Copy link
Member

Is approximate median more performant than quantile when there are an odd number of elements in the Series?

@lukemanley
Copy link
Member Author

Even with odd numbers it is still just an estimate, not a true median. It also looks like quantile is more performant:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: ser = pd.Series(np.random.randn(999_999), dtype="float64[pyarrow]")

In [4]: ser.median()
Out[4]: -0.00010149162825897479

In [5]: ser.quantile(0.5)
Out[5]: -0.00011543661536270946

In [6]: np.median(ser.to_numpy())
Out[6]: -0.00011543661536270946

In [7]: %timeit ser.median()
59.4 ms ± 91.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit ser.quantile(0.5)
13.2 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@mroeschke
Copy link
Member

Okay let's definitely use quantile then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants