-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Rank categorical #15422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rank categorical #15422
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a whatsnew note, bug fix for 0.20.0
pandas/tests/test_categorical.py
Outdated
@@ -4549,6 +4549,13 @@ def test_concat_categorical(self): | |||
'h': [None] * 6 + cat_values}) | |||
tm.assert_frame_equal(res, exp) | |||
|
|||
def test_rank_categorical(self): | |||
exp = pd.Series([1., 2., 3., 4., 5., 6.], name='A') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment
pandas/tests/test_categorical.py
Outdated
def test_rank_categorical(self): | ||
exp = pd.Series([1., 2., 3., 4., 5., 6.], name='A') | ||
dframe = pd.DataFrame(['first', 'second', 'third', 'fourth', 'fifth', 'sixth'], columns=['A']) | ||
dframe['A'] = dframe['A'].astype('category', ).cat.set_categories( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just use a Series directly, no need to use a frame
pandas/tests/test_categorical.py
Outdated
dframe['A'] = dframe['A'].astype('category', ).cat.set_categories( | ||
['first', 'second', 'third', 'fourth', 'fifth', 'sixth'], ordered=True) | ||
res = dframe['A'].rank() | ||
tm.assert_series_equal(res, exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add using an ordered=False
as well. This should rank the same as s.astype(object).rank()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ikilledthecat you will need to check for orderedness in the _get_data_algo
. Because for non-ordered, we don't want to use the order of the categories, so ranking based on the codes will not be correct.
It may be faster to actually recompute the codes (something like values.set_categories(values.categories.sort_values()).codes
) and still use the integer codes instead of taking the object path for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think it might be best to simply define .rank()
on a Categorical
; here you would call values.rank(....with the provided args)
. This should also be tested with a CategoricalIndex
, we might need to define a passthru method on that as well.
Inside Categorical
you can THEN construct an object you want ranked (a Series) with the appropriate value (e.g. the codes), but you will have to substitute NaN if any -1's. Then just pass thru the resulting rankings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche @jreback actually just checking if the values are ordered with an and condition passes my tests
elif is_categorical_dtype(values) and values.ordered
so unordered categorical will default to the current implementation, which is the required result.
let me know if i'm missing something here
pandas/tests/test_categorical.py
Outdated
@@ -4549,6 +4549,13 @@ def test_concat_categorical(self): | |||
'h': [None] * 6 + cat_values}) | |||
tm.assert_frame_equal(res, exp) | |||
|
|||
def test_rank_categorical(self): | |||
exp = pd.Series([1., 2., 3., 4., 5., 6.], name='A') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this tests to pandas/tests/series/test_analytics.py
(and put next to the other rank tests)
pandas/tests/test_categorical.py
Outdated
@@ -4549,6 +4549,13 @@ def test_concat_categorical(self): | |||
'h': [None] * 6 + cat_values}) | |||
tm.assert_frame_equal(res, exp) | |||
|
|||
def test_rank_categorical(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test with NaN
s? IIRC these are stored as -1
for the codes, but should be NaN
in the output of rank.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point @TomAugspurger
please also exericse the na_option
and ascending
options as well.
Use other tests as models in tests_anlaytics for this type of testing.
Codecov Report
@@ Coverage Diff @@
## master #15422 +/- ##
==========================================
+ Coverage 90.36% 90.36% +<.01%
==========================================
Files 136 136
Lines 49532 49543 +11
==========================================
+ Hits 44760 44771 +11
Misses 4772 4772
Continue to review full report at Codecov.
|
@jreback I have added tests for testing na_option and descending arguments to rank function. All tests are passing. |
pandas/core/algorithms.py
Outdated
elif is_categorical_dtype(values): | ||
f = func_map['int64'] | ||
values = _ensure_int64(values.codes) | ||
elif is_categorical_dtype(values) and values.ordered: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I said I'd like this to be actually in Categorical
. and don't use np.vectorize
. use masking.
something like (inside Categorical)
codes = self.codes
mask = codes == -1
if mask.any():
codes = codes.astype('float64')
codes[mask] = np.nan
@@ -1057,6 +1057,59 @@ def test_rank(self): | |||
iranks = iseries.rank() | |||
assert_series_equal(iranks, exp) | |||
|
|||
# GH issue #15420 rank incorrectly orders ordered categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a separate test
pandas/core/algorithms.py
Outdated
@@ -989,6 +989,11 @@ def _get_data_algo(values, func_map): | |||
f = func_map['uint64'] | |||
values = _ensure_uint64(values) | |||
|
|||
elif is_categorical_dtype(values) and values.ordered: | |||
nanMapper = np.vectorize(lambda t: np.NaN if t == -1 else t*1.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ikilledthecat You can do this vectorized, which will be much more performant than applying this function on each element (like codes[codes==-1] = np.nan
, after converting it to float)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oeps, missed that @jreback also already commented this.
# Test na_option for rank data | ||
na_ser = pd.Series( | ||
['first', 'second', 'third', 'fourth', 'fifth', 'sixth', np.NaN] | ||
).astype('category', ).cat.set_categories( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can pass categories=[..], ordered=True
within the astype
call
assert_series_equal( | ||
na_ser.rank(na_option='keep'), | ||
exp_keep | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add such NA tests for the unordered case as well?
And can you put those asserts on a single line like
assert_series_equal(na_ser.rank(na_option='bottom'), exp_bot)
that should fit within the PEP8 line number of chars limit, but will be more condense
Moved rank function as method inside categoricals. Added additional tests in tests/series/test_analytics though i feel these tests should reside in tests/test_categoricals |
pandas/core/categorical.py
Outdated
@@ -1364,6 +1365,54 @@ def sort_values(self, inplace=False, ascending=True, na_position='last'): | |||
return self._constructor(values=codes, categories=self.categories, | |||
ordered=self.ordered, fastpath=True) | |||
|
|||
def rank(self, method='average', na_option='keep', | |||
ascending=True, pct=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of duplicating the doc-string. Move the one from pandas/core/generic.py
to pandas/core/base.py
(ONLY the doc-string) and put it in shared_docs. Then import to Categorical as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other proposal: we make this a private function (_rank
), and than the full docstring is not needed (just some explanation of why this function is here and what it does).
IMO it is not needed to have a public Categorical.rank
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is fine as well (then can just accept *args, **kwargs
pandas/core/categorical.py
Outdated
values, ties_method=method, | ||
na_option=na_option, ascending=ascending, pct=pct | ||
) | ||
return Series(ranks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, return an ndarray
pandas/core/categorical.py
Outdated
(e.g. 1, 2, 3) or in percentile form (e.g. 0.333..., 0.666..., 1). | ||
""" | ||
from pandas.core.series import Series | ||
if na_option not in ['keep', 'top', 'bottom']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this na validation actually needs to be done in pandas.core.algorithms.rank
itself (so will cover all uses).
pandas/core/categorical.py
Outdated
raise ValueError('invalid na_position: {!r}'.format(na_option)) | ||
|
||
codes = self._codes.copy() | ||
codes = codes.astype(float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float -> 'float64'
pandas/core/categorical.py
Outdated
if self._ordered: | ||
na_mask = (codes == -1) | ||
codes[na_mask] = np.nan | ||
codes = _ensure_float64(codes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of all this
at the top rank
to the imports from pandas.core.algorithms
if self.ordered:
codes[codes==-1] = np.nan
return rank(codes, ......)
# Test ascending/descending ranking for ordered categoricals | ||
exp = pd.Series([1., 2., 3., 4., 5., 6.]) | ||
exp_desc = pd.Series([6., 5., 4., 3., 2., 1.]) | ||
ordered = pd.Categorical( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the categorical tests in pandas/tests/test_categorical. leave the series tests here.
created private method _rank instead of public rank method. |
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -585,3 +585,4 @@ Bug Fixes | |||
- Bug in ``Series.replace`` and ``DataFrame.replace`` which failed on empty replacement dicts (:issue:`15289`) | |||
- Bug in ``pd.melt()`` where passing a tuple value for ``value_vars`` caused a ``TypeError`` (:issue:`15348`) | |||
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`) | |||
- Bug in ``.rank()`` rank incorrectly orders ordered categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't put this at the end, instead use a blank space that I have left above. Add the issue number at the end.
pandas/core/algorithms.py
Outdated
@@ -620,7 +620,10 @@ def rank(values, axis=0, method='average', na_option='keep', | |||
Whether or not to the display the returned rankings in integer form | |||
(e.g. 1, 2, 3) or in percentile form (e.g. 0.333..., 0.666..., 1). | |||
""" | |||
if values.ndim == 1: | |||
if is_categorical(values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, instead of here modify _get_data_algo
, check if its is_categorical_dtype
, then call values = values._values_for_rank()
(we'll change the name below)
pandas/core/categorical.py
Outdated
@@ -1364,6 +1364,29 @@ def sort_values(self, inplace=False, ascending=True, na_position='last'): | |||
return self._constructor(values=codes, categories=self.categories, | |||
ordered=self.ordered, fastpath=True) | |||
|
|||
def _rank(self, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this _values_for_rank()
pandas/core/categorical.py
Outdated
""" | ||
For correctly ranking ordered categorical data. See GH#15420 | ||
|
||
Ordered categorical data should be ranked on the basis of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with -1 translated to NaN
pandas/core/categorical.py
Outdated
if self._ordered: | ||
codes = self._codes.astype('float64') | ||
na_mask = (codes == -1) | ||
codes[na_mask] = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are uncessarily converting to object and we are not longer calling rank directly here. (and you don't need to pass thru args, kwargs).
if self.ordered:
values = self.codes
mask = values == -1
if mask.any():
values = values.astype('float64')
values[mask] = np.nan
else:
values = np.array(values)
assert_series_equal(ordered.rank(ascending=False), exp_desc) | ||
|
||
# Unordered categoricals should be ranked as objects | ||
unordered = pd.Series( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write down the expected result here and use it in the comparison
pandas/core/algorithms.py
Outdated
@@ -988,7 +988,9 @@ def _get_data_algo(values, func_map): | |||
elif is_unsigned_integer_dtype(values): | |||
f = func_map['uint64'] | |||
values = _ensure_uint64(values) | |||
|
|||
elif is_categorical(values) and values._ordered: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, don't use a private values._ordered
here. and use is_categorical_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem here is that the ranking function will change based on whether the categorical is ordered. ordered categorical will be ranked based on the codes('float64') unordered categorical will be ranked as objects/strings. This ties in with your comment on pandas/core/categoricals as well. What you showed there is
if self.ordered:
values = self.codes
mask = values == -1
if mask.any():
values = values.astype('float64')
values[mask] = np.nan
else:
values = np.array(values) # values not defined inside np.array
values is not defined in the else clause (inside np.array) so i assume it one of self.codes or self.
In the case that you pass self.codes the ranking will not be consistent with the current functionality. In the case you pass self, a single ranking function will not be able to rank ordered/unordered categoricals. Please let me know if i'm missing something obvious here.
Edit:
A simple example to test out how unordered categoricals are handled currently
a = pd.Categorical(['first', 'second', 'third', 'fourth', np.NaN], ['first', 'second', 'third', 'fourth'], ordered=False)
b = pd.Series(a)
c = pd.Series(a.codes)
all(c.rank() == b.rank())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that else should be values = np.array(self)
, which will order by the actual objects whether integers, objects or whatever. Its NOT correct to astype(object
, rather to use the natural object ordering which is dependent on the dtype (but that is already handled correctly). The key point is that if we are ordered we return a float array based on the codes, else return self.
pandas/core/categorical.py
Outdated
if self._ordered: | ||
na_mask = (values == -1) | ||
values[na_mask] = np.nan | ||
return values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls do it like I showed. this MUST be in here and not in external scope (e.g. algos)
Sorry for the delay. Got caught up with work. _values_for_rank works as asked. Hopefully this solves the problem in an acceptable way. |
@ikilledthecat something went wrong with a rebase I think. Can you rebase again against latest master and push again?
|
9c8b2b8
to
3ba4e3a
Compare
rebased against latest master |
pandas/core/algorithms.py
Outdated
@@ -988,7 +988,12 @@ def _get_data_algo(values, func_map): | |||
elif is_unsigned_integer_dtype(values): | |||
f = func_map['uint64'] | |||
values = _ensure_uint64(values) | |||
|
|||
elif is_categorical_dtype(values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you move this to before the first if, you can do:
if is_categorical_dtype(values):
values = values._values_for_rank()
if is_float_dtype......
pandas/tests/test_categorical.py
Outdated
@@ -4549,7 +4549,6 @@ def test_concat_categorical(self): | |||
'h': [None] * 6 + cat_values}) | |||
tm.assert_frame_equal(res, exp) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
believe it not I bet this causes a flake error
@ikilledthecat 2 small comments. ping on all green. |
@jreback changes done!! thanks for the help and patience! |
values = values.astype('float64') | ||
values[mask] = np.nan | ||
else: | ||
values = np.array(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as I said before, I think it can be faster to actually reorder the categories (so rank can use the integer/float codes) instead of passing an object array to rank (or at least check whether the categories are sorted, and in such case also pass the codes).
But given that this is also the current situation, it's not a blocker for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're interested, that can in a follow-up PR (to get this one merged)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, @jorisvandenbossche didn't see that comment. yes could be an easy followup, something like.
In [16]: res = Series(np.array(s.cat.rename_categories(Series(s.cat.categories).rank()))).rank()
In [17]: res2 = s.rank()
In [18]: res.equals(res2)
Out[18]: True
In [19]: %timeit Series(np.array(s.cat.rename_categories(Series(s.cat.categories).rank()))).rank()
100 loops, best of 3: 4.39 ms per loop
In [20]: %timeit s.rank()
10 loops, best of 3: 132 ms per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche sure. Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the issue #15498 for this
thanks @ikilledthecat note that for the perf issues, might have to slightly reorg the code (meaning that |
check for categorical, and then pass the underlying integer codes. closes pandas-dev#15420 Author: Prasanjit Prakash <[email protected]> Closes pandas-dev#15422 from ikilledthecat/rank_categorical and squashes the following commits: a7e573b [Prasanjit Prakash] moved test for categorical, in rank, to top 3ba4e3a [Prasanjit Prakash] corrections after rebasing c43a029 [Prasanjit Prakash] using if/else construct to pick sorting function for categoricals f8ec019 [Prasanjit Prakash] ask Categorical for ranking function 40d88c1 [Prasanjit Prakash] return values for rank from categorical object 049c0fc [Prasanjit Prakash] GH#15420 added support for na_option when ranking categorical 5e5bbeb [Prasanjit Prakash] BUG: GH#15420 rank for categoricals ef999c3 [Prasanjit Prakash] merged with upstream master fbaba1b [Prasanjit Prakash] return values for rank from categorical object fa0b4c2 [Prasanjit Prakash] BUG: GH15420 - _rank private method on Categorical 9a6b5cd [Prasanjit Prakash] BUG: GH15420 - _rank private method on Categorical 4220e56 [Prasanjit Prakash] BUG: GH15420 - _rank private method on Categorical 6b70921 [Prasanjit Prakash] GH#15420 move rank inside categoricals bf4e36c [Prasanjit Prakash] GH#15420 added support for na_option when ranking categorical ce90207 [Prasanjit Prakash] BUG: GH#15420 rank for categoricals 85b267a [Prasanjit Prakash] Added support for categorical datatype in rank - issue#15420
check for categorical, and then pass the underlying integer codes.
git diff upstream/master | flake8 --diff