Skip to content

Rank categorical #15422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -578,6 +578,7 @@ Bug Fixes



- Bug in ``.rank()`` which incorrectly ranks ordered categories (:issue:`15420`)



Expand Down
5 changes: 4 additions & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -973,6 +973,10 @@ def _hashtable_algo(f, values, return_dtype=None):
def _get_data_algo(values, func_map):

f = None

if is_categorical_dtype(values):
values = values._values_for_rank()

if is_float_dtype(values):
f = func_map['float64']
values = _ensure_float64(values)
Expand All @@ -988,7 +992,6 @@ def _get_data_algo(values, func_map):
elif is_unsigned_integer_dtype(values):
f = func_map['uint64']
values = _ensure_uint64(values)

else:
values = _ensure_object(values)

Expand Down
22 changes: 22 additions & 0 deletions pandas/core/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1404,6 +1404,28 @@ def sort_values(self, inplace=False, ascending=True, na_position='last'):
return self._constructor(values=codes, categories=self.categories,
ordered=self.ordered, fastpath=True)

def _values_for_rank(self):
"""
For correctly ranking ordered categorical data. See GH#15420

Ordered categorical data should be ranked on the basis of
codes with -1 translated to NaN.

Returns
-------
numpy array

"""
if self.ordered:
values = self.codes
mask = values == -1
if mask.any():
values = values.astype('float64')
values[mask] = np.nan
else:
values = np.array(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said before, I think it can be faster to actually reorder the categories (so rank can use the integer/float codes) instead of passing an object array to rank (or at least check whether the categories are sorted, and in such case also pass the codes).
But given that this is also the current situation, it's not a blocker for this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're interested, that can in a follow-up PR (to get this one merged)

Copy link
Contributor

@jreback jreback Feb 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, @jorisvandenbossche didn't see that comment. yes could be an easy followup, something like.

In [16]: res = Series(np.array(s.cat.rename_categories(Series(s.cat.categories).rank()))).rank()

In [17]: res2 = s.rank()

In [18]: res.equals(res2)
Out[18]: True

In [19]: %timeit Series(np.array(s.cat.rename_categories(Series(s.cat.categories).rank()))).rank()
100 loops, best of 3: 4.39 ms per loop

In [20]: %timeit s.rank()
10 loops, best of 3: 132 ms per loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche sure. Will do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the issue #15498 for this

return values

def order(self, inplace=False, ascending=True, na_position='last'):
"""
DEPRECATED: use :meth:`Categorical.sort_values`. That function
Expand Down
78 changes: 78 additions & 0 deletions pandas/tests/series/test_analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -1057,6 +1057,84 @@ def test_rank(self):
iranks = iseries.rank()
assert_series_equal(iranks, exp)

def test_rank_categorical(self):
# GH issue #15420 rank incorrectly orders ordered categories

# Test ascending/descending ranking for ordered categoricals
exp = pd.Series([1., 2., 3., 4., 5., 6.])
exp_desc = pd.Series([6., 5., 4., 3., 2., 1.])
ordered = pd.Series(
['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
).astype('category', ).cat.set_categories(
['first', 'second', 'third', 'fourth', 'fifth', 'sixth'],
ordered=True
)
assert_series_equal(ordered.rank(), exp)
assert_series_equal(ordered.rank(ascending=False), exp_desc)

# Unordered categoricals should be ranked as objects
unordered = pd.Series(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write down the expected result here and use it in the comparison

['first', 'second', 'third', 'fourth', 'fifth', 'sixth'],
).astype('category').cat.set_categories(
['first', 'second', 'third', 'fourth', 'fifth', 'sixth'],
ordered=False
)
exp_unordered = pd.Series([2., 4., 6., 3., 1., 5.])
res = unordered.rank()
assert_series_equal(res, exp_unordered)

# Test na_option for rank data
na_ser = pd.Series(
['first', 'second', 'third', 'fourth', 'fifth', 'sixth', np.NaN]
).astype('category', ).cat.set_categories(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can pass categories=[..], ordered=True within the astype call

[
'first', 'second', 'third', 'fourth',
'fifth', 'sixth', 'seventh'
],
ordered=True
)

exp_top = pd.Series([2., 3., 4., 5., 6., 7., 1.])
exp_bot = pd.Series([1., 2., 3., 4., 5., 6., 7.])
exp_keep = pd.Series([1., 2., 3., 4., 5., 6., np.NaN])

assert_series_equal(na_ser.rank(na_option='top'), exp_top)
assert_series_equal(na_ser.rank(na_option='bottom'), exp_bot)
assert_series_equal(na_ser.rank(na_option='keep'), exp_keep)

# Test na_option for rank data with ascending False
exp_top = pd.Series([7., 6., 5., 4., 3., 2., 1.])
exp_bot = pd.Series([6., 5., 4., 3., 2., 1., 7.])
exp_keep = pd.Series([6., 5., 4., 3., 2., 1., np.NaN])

assert_series_equal(
na_ser.rank(na_option='top', ascending=False),
exp_top
)
assert_series_equal(
na_ser.rank(na_option='bottom', ascending=False),
exp_bot
)
assert_series_equal(
na_ser.rank(na_option='keep', ascending=False),
exp_keep
)

# Test with pct=True
na_ser = pd.Series(
['first', 'second', 'third', 'fourth', np.NaN],
).astype('category').cat.set_categories(
['first', 'second', 'third', 'fourth'],
ordered=True
)
exp_top = pd.Series([0.4, 0.6, 0.8, 1., 0.2])
exp_bot = pd.Series([0.2, 0.4, 0.6, 0.8, 1.])
exp_keep = pd.Series([0.25, 0.5, 0.75, 1., np.NaN])

assert_series_equal(na_ser.rank(na_option='top', pct=True), exp_top)
assert_series_equal(na_ser.rank(na_option='bottom', pct=True), exp_bot)
assert_series_equal(na_ser.rank(na_option='keep', pct=True), exp_keep)

def test_rank_signature(self):
s = Series([0, 1])
s.rank(method='average')
Expand Down