Skip to content

CLN: replace pandas.compat.scipy.scoreatpercentile with numpy.percentile #6810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 16, 2014
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions doc/source/release.rst
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,10 @@ API Changes
(and numpy defaults)
- add ``inplace`` keyword to ``Series.order/sort`` to make them inverses (:issue:`6859`)

- Replace ``pandas.compat.scipy.scoreatpercentile`` with ``numpy.percentile`` (:issue:`6810`)
- ``.quantile`` on a ``datetime[ns]`` series now returns ``Timestamp`` instead
of ``np.datetime64`` objects (:issue:`6810`)

Deprecations
~~~~~~~~~~~~

Expand Down
82 changes: 0 additions & 82 deletions pandas/compat/scipy.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,88 +6,6 @@
import numpy as np


def scoreatpercentile(a, per, limit=(), interpolation_method='fraction'):
"""Calculate the score at the given `per` percentile of the sequence `a`.

For example, the score at `per=50` is the median. If the desired quantile
lies between two data points, we interpolate between them, according to
the value of `interpolation`. If the parameter `limit` is provided, it
should be a tuple (lower, upper) of two values. Values of `a` outside
this (closed) interval will be ignored.

The `interpolation_method` parameter supports three values, namely
`fraction` (default), `lower` and `higher`. Interpolation is done only,
if the desired quantile lies between two data points `i` and `j`. For
`fraction`, the result is an interpolated value between `i` and `j`;
for `lower`, the result is `i`, for `higher` the result is `j`.

Parameters
----------
a : ndarray
Values from which to extract score.
per : scalar
Percentile at which to extract score.
limit : tuple, optional
Tuple of two scalars, the lower and upper limits within which to
compute the percentile.
interpolation_method : {'fraction', 'lower', 'higher'}, optional
This optional parameter specifies the interpolation method to use,
when the desired quantile lies between two data points `i` and `j`:

- fraction: `i + (j - i)*fraction`, where `fraction` is the
fractional part of the index surrounded by `i` and `j`.
- lower: `i`.
- higher: `j`.

Returns
-------
score : float
Score at percentile.

See Also
--------
percentileofscore

Examples
--------
>>> from scipy import stats
>>> a = np.arange(100)
>>> stats.scoreatpercentile(a, 50)
49.5

"""
# TODO: this should be a simple wrapper around a well-written quantile
# function. GNU R provides 9 quantile algorithms (!), with differing
# behaviour at, for example, discontinuities.
values = np.sort(a, axis=0)
if limit:
values = values[(limit[0] <= values) & (values <= limit[1])]

idx = per / 100. * (values.shape[0] - 1)
if idx % 1 == 0:
score = values[idx]
else:
if interpolation_method == 'fraction':
score = _interpolate(values[int(idx)], values[int(idx) + 1],
idx % 1)
elif interpolation_method == 'lower':
score = values[np.floor(idx)]
elif interpolation_method == 'higher':
score = values[np.ceil(idx)]
else:
raise ValueError("interpolation_method can only be 'fraction', "
"'lower' or 'higher'")

return score


def _interpolate(a, b, fraction):
"""Returns the point at the given fraction between a and b, where
'fraction' must be between 0 and 1.
"""
return a + (b - a) * fraction


def rankdata(a):
"""
Ranks the data, dealing with ties appropriately.
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
import pandas.computation.expressions as expressions
from pandas.computation.eval import eval as _eval
from pandas.computation.scope import _ensure_scope
from pandas.compat.scipy import scoreatpercentile as _quantile
from numpy import percentile as _quantile
from pandas.compat import(range, zip, lrange, lmap, lzip, StringIO, u,
OrderedDict, raise_with_traceback)
from pandas import compat
Expand Down
11 changes: 6 additions & 5 deletions pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@
import pandas.tslib as tslib
import pandas.index as _index

from pandas.compat.scipy import scoreatpercentile as _quantile
from numpy import percentile as _quantile
from pandas.core.config import get_option

__all__ = ['Series']
Expand Down Expand Up @@ -1235,10 +1235,11 @@ def quantile(self, q=0.5):
valid_values = self.dropna().values
if len(valid_values) == 0:
return pa.NA
result = _quantile(valid_values, q * 100)
if not np.isscalar and com.is_timedelta64_dtype(result):
from pandas.tseries.timedeltas import to_timedelta
return to_timedelta(result)
if com.is_datetime64_dtype(self):
values = _values_from_object(self).view('i8')
result = lib.Timestamp(_quantile(values, q * 100))
else:
result = _quantile(valid_values, q * 100)

return result

Expand Down
6 changes: 3 additions & 3 deletions pandas/tests/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -10915,13 +10915,13 @@ def wrapper(x):
check_dtype=False, check_dates=True)

def test_quantile(self):
from pandas.compat.scipy import scoreatpercentile
from numpy import percentile

q = self.tsframe.quantile(0.1, axis=0)
self.assertEqual(q['A'], scoreatpercentile(self.tsframe['A'], 10))
self.assertEqual(q['A'], percentile(self.tsframe['A'], 10))
q = self.tsframe.quantile(0.9, axis=1)
q = self.intframe.quantile(0.1)
self.assertEqual(q['A'], scoreatpercentile(self.intframe['A'], 10))
self.assertEqual(q['A'], percentile(self.intframe['A'], 10))

# test degenerate case
q = DataFrame({'x': [], 'y': []}).quantile(0.1, axis=0)
Expand Down
16 changes: 8 additions & 8 deletions pandas/tests/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1907,17 +1907,17 @@ def test_groupby_with_hier_columns(self):
self.assert_(result.columns.equals(df.columns[:-1]))

def test_pass_args_kwargs(self):
from pandas.compat.scipy import scoreatpercentile
from numpy import percentile

def f(x, q=None):
return scoreatpercentile(x, q)
g = lambda x: scoreatpercentile(x, 80)
def f(x, q=None, axis=0):
return percentile(x, q, axis=axis)
g = lambda x: percentile(x, 80, axis=0)

# Series
ts_grouped = self.ts.groupby(lambda x: x.month)
agg_result = ts_grouped.agg(scoreatpercentile, 80)
apply_result = ts_grouped.apply(scoreatpercentile, 80)
trans_result = ts_grouped.transform(scoreatpercentile, 80)
agg_result = ts_grouped.agg(percentile, 80, axis=0)
apply_result = ts_grouped.apply(percentile, 80, axis=0)
trans_result = ts_grouped.transform(percentile, 80, axis=0)

agg_expected = ts_grouped.quantile(.8)
trans_expected = ts_grouped.transform(g)
Expand All @@ -1935,7 +1935,7 @@ def f(x, q=None):

# DataFrame
df_grouped = self.tsframe.groupby(lambda x: x.month)
agg_result = df_grouped.agg(scoreatpercentile, 80)
agg_result = df_grouped.agg(percentile, 80, axis=0)
apply_result = df_grouped.apply(DataFrame.quantile, .8)
expected = df_grouped.quantile(.8)
assert_frame_equal(apply_result, expected)
Expand Down
19 changes: 15 additions & 4 deletions pandas/tests/test_series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2137,17 +2137,28 @@ def test_prod_numpy16_bug(self):
self.assertNotIsInstance(result, Series)

def test_quantile(self):
from pandas.compat.scipy import scoreatpercentile
from numpy import percentile

q = self.ts.quantile(0.1)
self.assertEqual(q, scoreatpercentile(self.ts.valid(), 10))
self.assertEqual(q, percentile(self.ts.valid(), 10))

q = self.ts.quantile(0.9)
self.assertEqual(q, scoreatpercentile(self.ts.valid(), 90))
self.assertEqual(q, percentile(self.ts.valid(), 90))

# object dtype
q = Series(self.ts,dtype=object).quantile(0.9)
self.assertEqual(q, scoreatpercentile(self.ts.valid(), 90))
self.assertEqual(q, percentile(self.ts.valid(), 90))

# datetime64[ns] dtype
dts = self.ts.index.to_series()
q = dts.quantile(.2)
self.assertEqual(q, Timestamp('2000-01-10 19:12:00'))

if not _np_version_under1p7:
# timedelta64[ns] dtype
tds = dts.diff()
q = tds.quantile(.25)
self.assertEqual(q, pd.to_timedelta('24:00:00'))

def test_describe(self):
_ = self.series.describe()
Expand Down
2 changes: 1 addition & 1 deletion pandas/tseries/tests/test_timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ def test_timedelta_ops(self):

result = td.quantile(.1)
# This properly returned a scalar.
expected = to_timedelta('00:00:02.6')
expected = np.timedelta64(2599999999,'ns')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a rounding issue yes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, Julian looked into the difference in rounding methods between pandas.compat.scipy.scoreatpercentile and numpy in a comment on #5824.. and also offered to update numpy. do you think this hard-coded expect should be removed and expect whatever numpy.percentile returns, in case they do change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think this is ok to change it, sort of go with np.percentile results.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do a fuzzy comparison instead of equality? (I guess as its an integers almost_equal does not work)
I may still update numpy as this method saves a few precious cycles for small percentiles

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliantaylor this pr replace our original method so using np.percentile now fully
and ok with all numpy (incl numpy master)

if u do make a change in numpy master then we can change the test (to more of a allclose one)

tm.assert_almost_equal(result, expected)

result = td.median()[0]
Expand Down