Skip to content

Rank categorical #15422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed

Conversation

jeetjitsu
Copy link
Contributor

check for categorical, and then pass the underlying integer codes.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew note, bug fix for 0.20.0

@@ -4549,6 +4549,13 @@ def test_concat_categorical(self):
'h': [None] * 6 + cat_values})
tm.assert_frame_equal(res, exp)

def test_rank_categorical(self):
exp = pd.Series([1., 2., 3., 4., 5., 6.], name='A')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment

def test_rank_categorical(self):
exp = pd.Series([1., 2., 3., 4., 5., 6.], name='A')
dframe = pd.DataFrame(['first', 'second', 'third', 'fourth', 'fifth', 'sixth'], columns=['A'])
dframe['A'] = dframe['A'].astype('category', ).cat.set_categories(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use a Series directly, no need to use a frame

dframe['A'] = dframe['A'].astype('category', ).cat.set_categories(
['first', 'second', 'third', 'fourth', 'fifth', 'sixth'], ordered=True)
res = dframe['A'].rank()
tm.assert_series_equal(res, exp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add using an ordered=False as well. This should rank the same as s.astype(object).rank()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikilledthecat you will need to check for orderedness in the _get_data_algo. Because for non-ordered, we don't want to use the order of the categories, so ranking based on the codes will not be correct.

It may be faster to actually recompute the codes (something like values.set_categories(values.categories.sort_values()).codes) and still use the integer codes instead of taking the object path for this case.

Copy link
Contributor

@jreback jreback Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think it might be best to simply define .rank() on a Categorical; here you would call values.rank(....with the provided args). This should also be tested with a CategoricalIndex, we might need to define a passthru method on that as well.

Inside Categorical you can THEN construct an object you want ranked (a Series) with the appropriate value (e.g. the codes), but you will have to substitute NaN if any -1's. Then just pass thru the resulting rankings.

Copy link
Contributor Author

@jeetjitsu jeetjitsu Feb 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche @jreback actually just checking if the values are ordered with an and condition passes my tests

elif is_categorical_dtype(values) and values.ordered

so unordered categorical will default to the current implementation, which is the required result.

let me know if i'm missing something here

@@ -4549,6 +4549,13 @@ def test_concat_categorical(self):
'h': [None] * 6 + cat_values})
tm.assert_frame_equal(res, exp)

def test_rank_categorical(self):
exp = pd.Series([1., 2., 3., 4., 5., 6.], name='A')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this tests to pandas/tests/series/test_analytics.py (and put next to the other rank tests)

@jreback jreback added Bug Categorical Categorical Data Type labels Feb 16, 2017
@@ -4549,6 +4549,13 @@ def test_concat_categorical(self):
'h': [None] * 6 + cat_values})
tm.assert_frame_equal(res, exp)

def test_rank_categorical(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test with NaNs? IIRC these are stored as -1 for the codes, but should be NaN in the output of rank.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point @TomAugspurger

please also exericse the na_option and ascending options as well.

Use other tests as models in tests_anlaytics for this type of testing.

@codecov-io
Copy link

codecov-io commented Feb 16, 2017

Codecov Report

Merging #15422 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15422      +/-   ##
==========================================
+ Coverage   90.36%   90.36%   +<.01%     
==========================================
  Files         136      136              
  Lines       49532    49543      +11     
==========================================
+ Hits        44760    44771      +11     
  Misses       4772     4772
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.48% <100%> (+0.01%)
pandas/core/categorical.py 96.91% <100%> (+0.04%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81c57e2...a7e573b. Read the comment docs.

@jeetjitsu
Copy link
Contributor Author

@jreback I have added tests for testing na_option and descending arguments to rank function. All tests are passing.

elif is_categorical_dtype(values):
f = func_map['int64']
values = _ensure_int64(values.codes)
elif is_categorical_dtype(values) and values.ordered:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said I'd like this to be actually in Categorical. and don't use np.vectorize. use masking.

something like (inside Categorical)

codes = self.codes
mask = codes == -1
if mask.any():
    codes = codes.astype('float64')
    codes[mask] = np.nan

@@ -1057,6 +1057,59 @@ def test_rank(self):
iranks = iseries.rank()
assert_series_equal(iranks, exp)

# GH issue #15420 rank incorrectly orders ordered categories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a separate test

@@ -989,6 +989,11 @@ def _get_data_algo(values, func_map):
f = func_map['uint64']
values = _ensure_uint64(values)

elif is_categorical_dtype(values) and values.ordered:
nanMapper = np.vectorize(lambda t: np.NaN if t == -1 else t*1.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikilledthecat You can do this vectorized, which will be much more performant than applying this function on each element (like codes[codes==-1] = np.nan, after converting it to float)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oeps, missed that @jreback also already commented this.

# Test na_option for rank data
na_ser = pd.Series(
['first', 'second', 'third', 'fourth', 'fifth', 'sixth', np.NaN]
).astype('category', ).cat.set_categories(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can pass categories=[..], ordered=True within the astype call

assert_series_equal(
na_ser.rank(na_option='keep'),
exp_keep
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add such NA tests for the unordered case as well?

And can you put those asserts on a single line like

    assert_series_equal(na_ser.rank(na_option='bottom'), exp_bot)

that should fit within the PEP8 line number of chars limit, but will be more condense

@jeetjitsu
Copy link
Contributor Author

Moved rank function as method inside categoricals. Added additional tests in tests/series/test_analytics though i feel these tests should reside in tests/test_categoricals

@@ -1364,6 +1365,54 @@ def sort_values(self, inplace=False, ascending=True, na_position='last'):
return self._constructor(values=codes, categories=self.categories,
ordered=self.ordered, fastpath=True)

def rank(self, method='average', na_option='keep',
ascending=True, pct=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of duplicating the doc-string. Move the one from pandas/core/generic.py to pandas/core/base.py (ONLY the doc-string) and put it in shared_docs. Then import to Categorical as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other proposal: we make this a private function (_rank), and than the full docstring is not needed (just some explanation of why this function is here and what it does).

IMO it is not needed to have a public Categorical.rank

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is fine as well (then can just accept *args, **kwargs

values, ties_method=method,
na_option=na_option, ascending=ascending, pct=pct
)
return Series(ranks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, return an ndarray

(e.g. 1, 2, 3) or in percentile form (e.g. 0.333..., 0.666..., 1).
"""
from pandas.core.series import Series
if na_option not in ['keep', 'top', 'bottom']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this na validation actually needs to be done in pandas.core.algorithms.rank itself (so will cover all uses).

raise ValueError('invalid na_position: {!r}'.format(na_option))

codes = self._codes.copy()
codes = codes.astype(float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float -> 'float64'

if self._ordered:
na_mask = (codes == -1)
codes[na_mask] = np.nan
codes = _ensure_float64(codes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of all this

at the top rank to the imports from pandas.core.algorithms

if self.ordered:
   codes[codes==-1] = np.nan
return rank(codes, ......)

# Test ascending/descending ranking for ordered categoricals
exp = pd.Series([1., 2., 3., 4., 5., 6.])
exp_desc = pd.Series([6., 5., 4., 3., 2., 1.])
ordered = pd.Categorical(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the categorical tests in pandas/tests/test_categorical. leave the series tests here.

@jeetjitsu
Copy link
Contributor Author

created private method _rank instead of public rank method.
_rank returns ndarray and accepts *args and **kwargs.
_rank uses the rank function from pandas.core.algorithms

@@ -585,3 +585,4 @@ Bug Fixes
- Bug in ``Series.replace`` and ``DataFrame.replace`` which failed on empty replacement dicts (:issue:`15289`)
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
- Bug in ``.rank()`` rank incorrectly orders ordered categories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't put this at the end, instead use a blank space that I have left above. Add the issue number at the end.

@@ -620,7 +620,10 @@ def rank(values, axis=0, method='average', na_option='keep',
Whether or not to the display the returned rankings in integer form
(e.g. 1, 2, 3) or in percentile form (e.g. 0.333..., 0.666..., 1).
"""
if values.ndim == 1:
if is_categorical(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, instead of here modify _get_data_algo, check if its is_categorical_dtype, then call values = values._values_for_rank() (we'll change the name below)

@@ -1364,6 +1364,29 @@ def sort_values(self, inplace=False, ascending=True, na_position='last'):
return self._constructor(values=codes, categories=self.categories,
ordered=self.ordered, fastpath=True)

def _rank(self, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this _values_for_rank()

"""
For correctly ranking ordered categorical data. See GH#15420

Ordered categorical data should be ranked on the basis of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with -1 translated to NaN

if self._ordered:
codes = self._codes.astype('float64')
na_mask = (codes == -1)
codes[na_mask] = np.nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are uncessarily converting to object and we are not longer calling rank directly here. (and you don't need to pass thru args, kwargs).

if self.ordered:
    values = self.codes
    mask = values == -1
    if mask.any():
        values = values.astype('float64')
        values[mask] = np.nan
else:
    values = np.array(values)

assert_series_equal(ordered.rank(ascending=False), exp_desc)

# Unordered categoricals should be ranked as objects
unordered = pd.Series(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write down the expected result here and use it in the comparison

@@ -988,7 +988,9 @@ def _get_data_algo(values, func_map):
elif is_unsigned_integer_dtype(values):
f = func_map['uint64']
values = _ensure_uint64(values)

elif is_categorical(values) and values._ordered:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, don't use a private values._ordered here. and use is_categorical_dtype

Copy link
Contributor Author

@jeetjitsu jeetjitsu Feb 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem here is that the ranking function will change based on whether the categorical is ordered. ordered categorical will be ranked based on the codes('float64') unordered categorical will be ranked as objects/strings. This ties in with your comment on pandas/core/categoricals as well. What you showed there is

if self.ordered:
    values = self.codes
    mask = values == -1
    if mask.any():
        values = values.astype('float64')
        values[mask] = np.nan
else:
    values = np.array(values) # values not defined inside np.array

values is not defined in the else clause (inside np.array) so i assume it one of self.codes or self.
In the case that you pass self.codes the ranking will not be consistent with the current functionality. In the case you pass self, a single ranking function will not be able to rank ordered/unordered categoricals. Please let me know if i'm missing something obvious here.

Edit:

A simple example to test out how unordered categoricals are handled currently

a = pd.Categorical(['first', 'second', 'third', 'fourth', np.NaN], ['first', 'second', 'third', 'fourth'], ordered=False)
b = pd.Series(a)
c = pd.Series(a.codes)
all(c.rank() == b.rank())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that else should be values = np.array(self), which will order by the actual objects whether integers, objects or whatever. Its NOT correct to astype(object, rather to use the natural object ordering which is dependent on the dtype (but that is already handled correctly). The key point is that if we are ordered we return a float array based on the codes, else return self.

if self._ordered:
na_mask = (values == -1)
values[na_mask] = np.nan
return values
Copy link
Contributor

@jreback jreback Feb 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls do it like I showed. this MUST be in here and not in external scope (e.g. algos)

@jeetjitsu
Copy link
Contributor Author

Sorry for the delay. Got caught up with work. _values_for_rank works as asked. Hopefully this solves the problem in an acceptable way.

@jorisvandenbossche
Copy link
Member

@ikilledthecat something went wrong with a rebase I think. Can you rebase again against latest master and push again?

git fetch upstream
git checkout rank_categorical
git rebase upstream/master
git push -f origin rank_categorical

@jeetjitsu
Copy link
Contributor Author

rebased against latest master

@@ -988,7 +988,12 @@ def _get_data_algo(values, func_map):
elif is_unsigned_integer_dtype(values):
f = func_map['uint64']
values = _ensure_uint64(values)

elif is_categorical_dtype(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you move this to before the first if, you can do:

if is_categorical_dtype(values):
    values = values._values_for_rank()

if is_float_dtype......

@@ -4549,7 +4549,6 @@ def test_concat_categorical(self):
'h': [None] * 6 + cat_values})
tm.assert_frame_equal(res, exp)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

believe it not I bet this causes a flake error

@jreback jreback added this to the 0.20.0 milestone Feb 24, 2017
@jreback
Copy link
Contributor

jreback commented Feb 24, 2017

@ikilledthecat 2 small comments. ping on all green.

@jeetjitsu
Copy link
Contributor Author

@jreback changes done!! thanks for the help and patience!

values = values.astype('float64')
values[mask] = np.nan
else:
values = np.array(self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as I said before, I think it can be faster to actually reorder the categories (so rank can use the integer/float codes) instead of passing an object array to rank (or at least check whether the categories are sorted, and in such case also pass the codes).
But given that this is also the current situation, it's not a blocker for this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're interested, that can in a follow-up PR (to get this one merged)

Copy link
Contributor

@jreback jreback Feb 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, @jorisvandenbossche didn't see that comment. yes could be an easy followup, something like.

In [16]: res = Series(np.array(s.cat.rename_categories(Series(s.cat.categories).rank()))).rank()

In [17]: res2 = s.rank()

In [18]: res.equals(res2)
Out[18]: True

In [19]: %timeit Series(np.array(s.cat.rename_categories(Series(s.cat.categories).rank()))).rank()
100 loops, best of 3: 4.39 ms per loop

In [20]: %timeit s.rank()
10 loops, best of 3: 132 ms per loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche sure. Will do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the issue #15498 for this

@jreback jreback mentioned this pull request Feb 24, 2017
@jreback jreback closed this in 3fe85af Feb 24, 2017
@jreback
Copy link
Contributor

jreback commented Feb 24, 2017

thanks @ikilledthecat

note that for the perf issues, might have to slightly reorg the code (meaning that _values_for_rank would need to be passed the .rank() params

@jeetjitsu jeetjitsu deleted the rank_categorical branch February 25, 2017 10:55
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this pull request Mar 21, 2017
check for categorical, and then pass the underlying integer codes.
closes pandas-dev#15420

Author: Prasanjit Prakash <[email protected]>

Closes pandas-dev#15422 from ikilledthecat/rank_categorical and squashes the following commits:

a7e573b [Prasanjit Prakash] moved test for categorical, in rank, to top
3ba4e3a [Prasanjit Prakash] corrections after rebasing
c43a029 [Prasanjit Prakash] using if/else construct to pick sorting function for categoricals
f8ec019 [Prasanjit Prakash] ask Categorical for ranking function
40d88c1 [Prasanjit Prakash] return values for rank from categorical object
049c0fc [Prasanjit Prakash] GH#15420 added support for na_option when ranking categorical
5e5bbeb [Prasanjit Prakash] BUG: GH#15420 rank for categoricals
ef999c3 [Prasanjit Prakash] merged with upstream master
fbaba1b [Prasanjit Prakash] return values for rank from categorical object
fa0b4c2 [Prasanjit Prakash] BUG: GH15420 - _rank private method on Categorical
9a6b5cd [Prasanjit Prakash] BUG: GH15420 - _rank private method on Categorical
4220e56 [Prasanjit Prakash] BUG: GH15420 - _rank private method on Categorical
6b70921 [Prasanjit Prakash] GH#15420 move rank inside categoricals
bf4e36c [Prasanjit Prakash] GH#15420 added support for na_option when ranking categorical
ce90207 [Prasanjit Prakash] BUG: GH#15420 rank for categoricals
85b267a [Prasanjit Prakash] Added support for categorical datatype in rank - issue#15420
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rank incorrectly orders ordered categories
6 participants