You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ArrowExtensionArray.median is implemented using pyarrow.compute.approximate_median which might produce surprising results for users looking for a "true" median:
In [1]: import pandas as pd
In [2]: pd.Series([1, 2], dtype="float64").median()
Out[2]: 1.5
In [3]: pd.Series([1, 2], dtype="float64[pyarrow]").median()
Out[3]: 1.0
It does not look like pyarrow provides a compute function for a "true" median. Is this intentional? Should it be documented somehow?
lukemanley
changed the title
BUG(?): ArrowExtensionArray.median is an "approximate_median"
BUG(?): ArrowExtensionArray.median is an "approximate median"
Apr 15, 2023
Even with odd numbers it is still just an estimate, not a true median. It also looks like quantile is more performant:
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: ser = pd.Series(np.random.randn(999_999), dtype="float64[pyarrow]")
In [4]: ser.median()
Out[4]: -0.00010149162825897479
In [5]: ser.quantile(0.5)
Out[5]: -0.00011543661536270946
In [6]: np.median(ser.to_numpy())
Out[6]: -0.00011543661536270946
In [7]: %timeit ser.median()
59.4 ms ± 91.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [8]: %timeit ser.quantile(0.5)
13.2 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
ArrowExtensionArray.median
is implemented usingpyarrow.compute.approximate_median
which might produce surprising results for users looking for a "true" median:It does not look like pyarrow provides a compute function for a "true" median. Is this intentional? Should it be documented somehow?
cc @mroeschke
The text was updated successfully, but these errors were encountered: