Rank categorical #15422

jeetjitsu · 2017-02-16T14:05:04Z

check for categorical, and then pass the underlying integer codes.

closes rank incorrectly orders ordered categories #15420
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

jreback

pls add a whatsnew note, bug fix for 0.20.0

jreback · 2017-02-16T14:07:08Z

pandas/tests/test_categorical.py

@@ -4549,6 +4549,13 @@ def test_concat_categorical(self):
                            'h': [None] * 6 + cat_values})
        tm.assert_frame_equal(res, exp)

+    def test_rank_categorical(self):
+        exp = pd.Series([1., 2., 3., 4., 5., 6.], name='A')


add the issue number as a comment

jreback · 2017-02-16T14:08:28Z

pandas/tests/test_categorical.py

+    def test_rank_categorical(self):
+        exp = pd.Series([1., 2., 3., 4., 5., 6.], name='A')
+        dframe = pd.DataFrame(['first', 'second', 'third', 'fourth', 'fifth', 'sixth'], columns=['A'])
+        dframe['A'] = dframe['A'].astype('category', ).cat.set_categories(


just use a Series directly, no need to use a frame

jreback · 2017-02-16T14:09:52Z

pandas/tests/test_categorical.py

+        dframe['A'] = dframe['A'].astype('category', ).cat.set_categories(
+                ['first', 'second', 'third', 'fourth', 'fifth', 'sixth'], ordered=True)
+        res = dframe['A'].rank()
+        tm.assert_series_equal(res, exp)


add using an ordered=False as well. This should rank the same as s.astype(object).rank()

@ikilledthecat you will need to check for orderedness in the _get_data_algo. Because for non-ordered, we don't want to use the order of the categories, so ranking based on the codes will not be correct.

It may be faster to actually recompute the codes (something like values.set_categories(values.categories.sort_values()).codes) and still use the integer codes instead of taking the object path for this case.

I actually think it might be best to simply define .rank() on a Categorical; here you would call values.rank(....with the provided args). This should also be tested with a CategoricalIndex, we might need to define a passthru method on that as well.

Inside Categorical you can THEN construct an object you want ranked (a Series) with the appropriate value (e.g. the codes), but you will have to substitute NaN if any -1's. Then just pass thru the resulting rankings.

@jorisvandenbossche @jreback actually just checking if the values are ordered with an and condition passes my tests

elif is_categorical_dtype(values) and values.ordered

so unordered categorical will default to the current implementation, which is the required result.

let me know if i'm missing something here

jreback · 2017-02-16T14:10:45Z

pandas/tests/test_categorical.py

@@ -4549,6 +4549,13 @@ def test_concat_categorical(self):
                            'h': [None] * 6 + cat_values})
        tm.assert_frame_equal(res, exp)

+    def test_rank_categorical(self):
+        exp = pd.Series([1., 2., 3., 4., 5., 6.], name='A')


move this tests to pandas/tests/series/test_analytics.py (and put next to the other rank tests)

TomAugspurger · 2017-02-16T14:12:17Z

pandas/tests/test_categorical.py

@@ -4549,6 +4549,13 @@ def test_concat_categorical(self):
                            'h': [None] * 6 + cat_values})
        tm.assert_frame_equal(res, exp)

+    def test_rank_categorical(self):


Can you add a test with NaNs? IIRC these are stored as -1 for the codes, but should be NaN in the output of rank.

good point @TomAugspurger

please also exericse the na_option and ascending options as well.

Use other tests as models in tests_anlaytics for this type of testing.

codecov-io · 2017-02-16T15:36:12Z

Codecov Report

Merging #15422 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15422      +/-   ##
==========================================
+ Coverage   90.36%   90.36%   +<.01%     
==========================================
  Files         136      136              
  Lines       49532    49543      +11     
==========================================
+ Hits        44760    44771      +11     
  Misses       4772     4772

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`94.48% <100%> (+0.01%)`	✅
pandas/core/categorical.py	`96.91% <100%> (+0.04%)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81c57e2...a7e573b. Read the comment docs.

jeetjitsu · 2017-02-16T18:01:14Z

@jreback I have added tests for testing na_option and descending arguments to rank function. All tests are passing.

jreback · 2017-02-16T18:02:04Z

pandas/core/algorithms.py

-    elif is_categorical_dtype(values):
-        f = func_map['int64']
-        values = _ensure_int64(values.codes)
+    elif is_categorical_dtype(values) and values.ordered:


as I said I'd like this to be actually in Categorical. and don't use np.vectorize. use masking.

something like (inside Categorical)

codes = self.codes mask = codes == -1 if mask.any(): codes = codes.astype('float64') codes[mask] = np.nan

jreback · 2017-02-16T18:02:20Z

pandas/tests/series/test_analytics.py

@@ -1057,6 +1057,59 @@ def test_rank(self):
        iranks = iseries.rank()
        assert_series_equal(iranks, exp)

+        # GH issue #15420 rank incorrectly orders ordered categories


make this a separate test

jorisvandenbossche · 2017-02-16T23:00:14Z

pandas/core/algorithms.py

@@ -989,6 +989,11 @@ def _get_data_algo(values, func_map):
        f = func_map['uint64']
        values = _ensure_uint64(values)

+    elif is_categorical_dtype(values) and values.ordered:
+        nanMapper = np.vectorize(lambda t: np.NaN if t == -1 else t*1.)


@ikilledthecat You can do this vectorized, which will be much more performant than applying this function on each element (like codes[codes==-1] = np.nan, after converting it to float)

Oeps, missed that @jreback also already commented this.

jorisvandenbossche · 2017-02-16T23:00:18Z

pandas/tests/series/test_analytics.py

+        # Test na_option for rank data
+        na_ser = pd.Series(
+            ['first', 'second', 'third', 'fourth', 'fifth', 'sixth', np.NaN]
+        ).astype('category', ).cat.set_categories(


you can pass categories=[..], ordered=True within the astype call

jorisvandenbossche · 2017-02-16T23:08:33Z

pandas/tests/series/test_analytics.py

+        assert_series_equal(
+            na_ser.rank(na_option='keep'),
+            exp_keep
+        )


Can you add such NA tests for the unordered case as well?

And can you put those asserts on a single line like

assert_series_equal(na_ser.rank(na_option='bottom'), exp_bot)

that should fit within the PEP8 line number of chars limit, but will be more condense

jeetjitsu · 2017-02-17T10:10:42Z

Moved rank function as method inside categoricals. Added additional tests in tests/series/test_analytics though i feel these tests should reside in tests/test_categoricals

jreback · 2017-02-17T14:59:35Z

pandas/core/categorical.py

@@ -1364,6 +1365,54 @@ def sort_values(self, inplace=False, ascending=True, na_position='last'):
            return self._constructor(values=codes, categories=self.categories,
                                     ordered=self.ordered, fastpath=True)

+    def rank(self, method='average', na_option='keep',
+             ascending=True, pct=False):


instead of duplicating the doc-string. Move the one from pandas/core/generic.py to pandas/core/base.py (ONLY the doc-string) and put it in shared_docs. Then import to Categorical as well.

Other proposal: we make this a private function (_rank), and than the full docstring is not needed (just some explanation of why this function is here and what it does).

IMO it is not needed to have a public Categorical.rank

that is fine as well (then can just accept *args, **kwargs

jreback · 2017-02-17T14:59:48Z

pandas/core/categorical.py

+                values, ties_method=method,
+                na_option=na_option, ascending=ascending, pct=pct
+            )
+        return Series(ranks)


no, return an ndarray

jreback · 2017-02-17T15:02:07Z

pandas/core/categorical.py

+            (e.g. 1, 2, 3) or in percentile form (e.g. 0.333..., 0.666..., 1).
+        """
+        from pandas.core.series import Series
+        if na_option not in ['keep', 'top', 'bottom']:


this na validation actually needs to be done in pandas.core.algorithms.rank itself (so will cover all uses).

jreback · 2017-02-17T15:02:38Z

pandas/core/categorical.py

+            raise ValueError('invalid na_position: {!r}'.format(na_option))
+
+        codes = self._codes.copy()
+        codes = codes.astype(float)


float -> 'float64'

jreback · 2017-02-17T15:04:50Z

pandas/core/categorical.py

+        if self._ordered:
+            na_mask = (codes == -1)
+            codes[na_mask] = np.nan
+            codes = _ensure_float64(codes)


instead of all this

at the top rank to the imports from pandas.core.algorithms

if self.ordered: codes[codes==-1] = np.nan return rank(codes, ......)

jreback · 2017-02-17T15:05:28Z

pandas/tests/series/test_analytics.py

+        # Test ascending/descending ranking for ordered categoricals
+        exp = pd.Series([1., 2., 3., 4., 5., 6.])
+        exp_desc = pd.Series([6., 5., 4., 3., 2., 1.])
+        ordered = pd.Categorical(


put the categorical tests in pandas/tests/test_categorical. leave the series tests here.

jeetjitsu · 2017-02-18T08:41:09Z

created private method _rank instead of public rank method.
_rank returns ndarray and accepts *args and **kwargs.
_rank uses the rank function from pandas.core.algorithms

jreback · 2017-02-18T15:05:14Z

doc/source/whatsnew/v0.20.0.txt

@@ -585,3 +585,4 @@ Bug Fixes
 - Bug in ``Series.replace`` and ``DataFrame.replace`` which failed on empty replacement dicts (:issue:`15289`)
 - Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`)
 - Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
+- Bug in ``.rank()`` rank incorrectly orders ordered categories


don't put this at the end, instead use a blank space that I have left above. Add the issue number at the end.

jreback · 2017-02-18T15:07:55Z

pandas/core/algorithms.py

@@ -620,7 +620,10 @@ def rank(values, axis=0, method='average', na_option='keep',
        Whether or not to the display the returned rankings in integer form
        (e.g. 1, 2, 3) or in percentile form (e.g. 0.333..., 0.666..., 1).
    """
-    if values.ndim == 1:
+    if is_categorical(values):


no, instead of here modify _get_data_algo, check if its is_categorical_dtype, then call values = values._values_for_rank() (we'll change the name below)

jreback · 2017-02-18T15:08:10Z

pandas/core/categorical.py

@@ -1364,6 +1364,29 @@ def sort_values(self, inplace=False, ascending=True, na_position='last'):
            return self._constructor(values=codes, categories=self.categories,
                                     ordered=self.ordered, fastpath=True)

+    def _rank(self, *args, **kwargs):


call this _values_for_rank()

jreback · 2017-02-18T15:08:38Z

pandas/core/categorical.py

+        """
+        For correctly ranking ordered categorical data. See GH#15420
+
+        Ordered categorical data should be ranked on the basis of


with -1 translated to NaN

jreback · 2017-02-18T15:10:47Z

pandas/core/categorical.py

+        if self._ordered:
+            codes = self._codes.astype('float64')
+            na_mask = (codes == -1)
+            codes[na_mask] = np.nan


you are uncessarily converting to object and we are not longer calling rank directly here. (and you don't need to pass thru args, kwargs).

if self.ordered: values = self.codes mask = values == -1 if mask.any(): values = values.astype('float64') values[mask] = np.nan else: values = np.array(values)

jreback · 2017-02-18T15:20:21Z

pandas/tests/series/test_analytics.py

+        assert_series_equal(ordered.rank(ascending=False), exp_desc)
+
+        # Unordered categoricals should be ranked as objects
+        unordered = pd.Series(


write down the expected result here and use it in the comparison

jreback · 2017-02-18T16:38:33Z

pandas/core/algorithms.py

@@ -988,7 +988,9 @@ def _get_data_algo(values, func_map):
    elif is_unsigned_integer_dtype(values):
        f = func_map['uint64']
        values = _ensure_uint64(values)
-
+    elif is_categorical(values) and values._ordered:


no, don't use a private values._ordered here. and use is_categorical_dtype

the problem here is that the ranking function will change based on whether the categorical is ordered. ordered categorical will be ranked based on the codes('float64') unordered categorical will be ranked as objects/strings. This ties in with your comment on pandas/core/categoricals as well. What you showed there is

if self.ordered: values = self.codes mask = values == -1 if mask.any(): values = values.astype('float64') values[mask] = np.nan else: values = np.array(values) # values not defined inside np.array

values is not defined in the else clause (inside np.array) so i assume it one of self.codes or self.
In the case that you pass self.codes the ranking will not be consistent with the current functionality. In the case you pass self, a single ranking function will not be able to rank ordered/unordered categoricals. Please let me know if i'm missing something obvious here.

Edit:

A simple example to test out how unordered categoricals are handled currently

a = pd.Categorical(['first', 'second', 'third', 'fourth', np.NaN], ['first', 'second', 'third', 'fourth'], ordered=False) b = pd.Series(a) c = pd.Series(a.codes) all(c.rank() == b.rank())

that else should be values = np.array(self), which will order by the actual objects whether integers, objects or whatever. Its NOT correct to astype(object, rather to use the natural object ordering which is dependent on the dtype (but that is already handled correctly). The key point is that if we are ordered we return a float array based on the codes, else return self.

jreback · 2017-02-18T16:38:55Z

pandas/core/categorical.py

+        if self._ordered:
+            na_mask = (values == -1)
+            values[na_mask] = np.nan
+        return values


pls do it like I showed. this MUST be in here and not in external scope (e.g. algos)

jeetjitsu · 2017-02-24T07:56:51Z

Sorry for the delay. Got caught up with work. _values_for_rank works as asked. Hopefully this solves the problem in an acceptable way.

jorisvandenbossche · 2017-02-24T09:26:39Z

@ikilledthecat something went wrong with a rebase I think. Can you rebase again against latest master and push again?

git fetch upstream
git checkout rank_categorical
git rebase upstream/master
git push -f origin rank_categorical

jeetjitsu · 2017-02-24T12:23:30Z

rebased against latest master

jreback · 2017-02-24T12:44:16Z

pandas/core/algorithms.py

@@ -988,7 +988,12 @@ def _get_data_algo(values, func_map):
    elif is_unsigned_integer_dtype(values):
        f = func_map['uint64']
        values = _ensure_uint64(values)
-
+    elif is_categorical_dtype(values):


if you move this to before the first if, you can do:

if is_categorical_dtype(values): values = values._values_for_rank() if is_float_dtype......

jreback · 2017-02-24T12:45:19Z

pandas/tests/test_categorical.py

@@ -4549,7 +4549,6 @@ def test_concat_categorical(self):
                            'h': [None] * 6 + cat_values})
        tm.assert_frame_equal(res, exp)

-


believe it not I bet this causes a flake error

jreback · 2017-02-24T12:46:36Z

@ikilledthecat 2 small comments. ping on all green.

jeetjitsu · 2017-02-24T13:20:37Z

@jreback changes done!! thanks for the help and patience!

jorisvandenbossche · 2017-02-24T13:22:17Z

pandas/core/categorical.py

+                values = values.astype('float64')
+                values[mask] = np.nan
+        else:
+            values = np.array(self)


as I said before, I think it can be faster to actually reorder the categories (so rank can use the integer/float codes) instead of passing an object array to rank (or at least check whether the categories are sorted, and in such case also pass the codes).
But given that this is also the current situation, it's not a blocker for this PR

If you're interested, that can in a follow-up PR (to get this one merged)

sorry, @jorisvandenbossche didn't see that comment. yes could be an easy followup, something like.

In [16]: res = Series(np.array(s.cat.rename_categories(Series(s.cat.categories).rank()))).rank() In [17]: res2 = s.rank() In [18]: res.equals(res2) Out[18]: True In [19]: %timeit Series(np.array(s.cat.rename_categories(Series(s.cat.categories).rank()))).rank() 100 loops, best of 3: 4.39 ms per loop In [20]: %timeit s.rank() 10 loops, best of 3: 132 ms per loop

@jorisvandenbossche sure. Will do

I added the issue #15498 for this

jreback · 2017-02-24T19:58:05Z

thanks @ikilledthecat

note that for the perf issues, might have to slightly reorg the code (meaning that _values_for_rank would need to be passed the .rank() params

check for categorical, and then pass the underlying integer codes. closes pandas-dev#15420 Author: Prasanjit Prakash <[email protected]> Closes pandas-dev#15422 from ikilledthecat/rank_categorical and squashes the following commits: a7e573b [Prasanjit Prakash] moved test for categorical, in rank, to top 3ba4e3a [Prasanjit Prakash] corrections after rebasing c43a029 [Prasanjit Prakash] using if/else construct to pick sorting function for categoricals f8ec019 [Prasanjit Prakash] ask Categorical for ranking function 40d88c1 [Prasanjit Prakash] return values for rank from categorical object 049c0fc [Prasanjit Prakash] GH#15420 added support for na_option when ranking categorical 5e5bbeb [Prasanjit Prakash] BUG: GH#15420 rank for categoricals ef999c3 [Prasanjit Prakash] merged with upstream master fbaba1b [Prasanjit Prakash] return values for rank from categorical object fa0b4c2 [Prasanjit Prakash] BUG: GH15420 - _rank private method on Categorical 9a6b5cd [Prasanjit Prakash] BUG: GH15420 - _rank private method on Categorical 4220e56 [Prasanjit Prakash] BUG: GH15420 - _rank private method on Categorical 6b70921 [Prasanjit Prakash] GH#15420 move rank inside categoricals bf4e36c [Prasanjit Prakash] GH#15420 added support for na_option when ranking categorical ce90207 [Prasanjit Prakash] BUG: GH#15420 rank for categoricals 85b267a [Prasanjit Prakash] Added support for categorical datatype in rank - issue#15420

jreback requested changes Feb 16, 2017

View reviewed changes

jreback added Bug Categorical Categorical Data Type labels Feb 16, 2017

TomAugspurger reviewed Feb 16, 2017

View reviewed changes

jreback reviewed Feb 16, 2017

View reviewed changes

jorisvandenbossche reviewed Feb 16, 2017

View reviewed changes

jreback reviewed Feb 17, 2017

View reviewed changes

jreback requested changes Feb 17, 2017

View reviewed changes

jreback requested changes Feb 18, 2017

View reviewed changes

jreback reviewed Feb 18, 2017

View reviewed changes

jeet63 added 11 commits February 24, 2017 17:05

Added support for categorical datatype in rank - issue#15420

85b267a

BUG: GH#15420 rank for categoricals

ce90207

GH#15420 added support for na_option when ranking categorical

bf4e36c

GH#15420 move rank inside categoricals

6b70921

BUG: GH15420 - _rank private method on Categorical

4220e56

BUG: GH15420 - _rank private method on Categorical

9a6b5cd

BUG: GH15420 - _rank private method on Categorical

fa0b4c2

return values for rank from categorical object

fbaba1b

merged with upstream master

ef999c3

BUG: GH#15420 rank for categoricals

5e5bbeb

GH#15420 added support for na_option when ranking categorical

049c0fc

jeet63 added 4 commits February 24, 2017 17:42

return values for rank from categorical object

40d88c1

ask Categorical for ranking function

f8ec019

using if/else construct to pick sorting function for categoricals

c43a029

corrections after rebasing

3ba4e3a

jeetjitsu force-pushed the rank_categorical branch from 9c8b2b8 to 3ba4e3a Compare February 24, 2017 12:22

jreback reviewed Feb 24, 2017

View reviewed changes

jreback added this to the 0.20.0 milestone Feb 24, 2017

jreback approved these changes Feb 24, 2017

View reviewed changes

moved test for categorical, in rank, to top

a7e573b

jorisvandenbossche reviewed Feb 24, 2017

View reviewed changes

jreback mentioned this pull request Feb 24, 2017

PERF: categorical rank #15498

Closed

jreback closed this in 3fe85af Feb 24, 2017

jeetjitsu deleted the rank_categorical branch February 25, 2017 10:55

jbrockmendel mentioned this pull request Apr 2, 2020

BUG: Categorical.values_for_(factorize|argsort) dont preserve order #33245

Open

		@@ -4549,7 +4549,6 @@ def test_concat_categorical(self):
		'h': [None] * 6 + cat_values})
		tm.assert_frame_equal(res, exp)

Rank categorical #15422

Rank categorical #15422

Conversation

jeetjitsu commented Feb 16, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Feb 16, 2017 • edited Loading

Choose a reason for hiding this comment

jeetjitsu Feb 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Feb 16, 2017 • edited Loading

Codecov Report

jeetjitsu commented Feb 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeetjitsu commented Feb 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeetjitsu commented Feb 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeetjitsu Feb 18, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Feb 18, 2017 • edited Loading

Choose a reason for hiding this comment

jeetjitsu commented Feb 24, 2017

jorisvandenbossche commented Feb 24, 2017

jeetjitsu commented Feb 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 24, 2017

jeetjitsu commented Feb 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Feb 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 24, 2017

jreback Feb 16, 2017 •

edited

Loading

jeetjitsu Feb 16, 2017 •

edited

Loading

codecov-io commented Feb 16, 2017 •

edited

Loading

jeetjitsu Feb 18, 2017 •

edited

Loading

jreback Feb 18, 2017 •

edited

Loading

jreback Feb 24, 2017 •

edited

Loading