-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: remove ordered requirement in Categorical.searchsorted #21686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
05aabc2
to
97040ca
Compare
Codecov Report
@@ Coverage Diff @@
## master #21686 +/- ##
=========================================
Coverage ? 91.77%
=========================================
Files ? 174
Lines ? 50704
Branches ? 0
=========================================
Hits ? 46535
Misses ? 4169
Partials ? 0
Continue to review full report at Codecov.
|
@topper-123 needs rebase |
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -220,7 +220,8 @@ Bug Fixes | |||
Categorical | |||
^^^^^^^^^^^ | |||
|
|||
- | |||
- :meth:`Categorical.searchsorted` and :meth:`CategoricalIndex.searchsorted` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the standard practice is to keep this as one long line
pandas/core/base.py
Outdated
@@ -1163,6 +1163,16 @@ def factorize(self, sort=False, na_sentinel=-1): | |||
corresponding elements in `value` were inserted before the indices, | |||
the order of `self` would be preserved. | |||
|
|||
.. note:: | |||
|
|||
The %(klass)s *must* be monotonically sorted, else wrong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“Else wrong” is awkward wording
categories=['cheese', 'milk', 'apple', 'bread'], | ||
ordered=False) | ||
s2 = Series(c2) | ||
c = Categorical(['cheese', 'milk', 'apple', 'bread', 'bread'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind calling these cat and ser or something like that. 1-letter names make grepping tough
97040ca
to
e553898
Compare
this needs evaluation |
I discuss some of my points in favor of this in #21667: In >>> c = pd.Categorical(['a', 'a', 'b', 'b', 'c'], categories=pd.Index(['b', 'a', 'c']))
>>> c.sort_values()
[b, b, a, a, c]
Categories (3, object): [b, a, c] We can perhaps continue the discussion in #21667. |
I agree with @topper-123: it is indeed easy to get wrong results with searchsorted, but that is a general issue with that function, and not necessarily specific to |
|
Yes, and this sorting will mean that the codes are sorted, so I don't understand why the above is a reason to not do this? |
I think its pretty unituitive to sort by codes in an unordered categorical, we should actually disallow e.g. we should raise just like we do for min/max.
|
Being able to sort is very useful, we discussed this before, so I am -1 on removing that ability (but if you want further discuss this, I would open a new issue). Note that, in the majority of the use cases for unordered categoricals, sorting by the codes or sorting lexicographically will be identical, exactly because we do sort the categories lexicographically on creation. |
I agree with @jorisvandenbossche. IMO It is beneficial to be able find locations of groups of values unordered values, so To do the above, unordered categoricals need to have |
Looks like this stalled. In the meantime, this could use a rebase.. |
Closing as stale. Ping if you'd like to continue |
I would actually very much like to continue this, but was under the impression it wasn't going to get in. If you'll open it, I'll push a rebase. |
will have another look |
e553898
to
b71e252
Compare
b71e252
to
21e0056
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain why you think its a good idea to enable this. This looks like it is just cause for trouble.
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -0,0 +1,731 @@ | |||
.. _whatsnew_0240: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file is included somehow, pls revert
doc/source/whatsnew/v0.25.0.rst
Outdated
@@ -127,7 +127,7 @@ Bug Fixes | |||
Categorical | |||
^^^^^^^^^^^ | |||
|
|||
- | |||
- :meth:`Categorical.searchsorted` and :meth:`CategoricalIndex.searchsorted` now work on unordered categoricals also (:issue:`21667`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also -> as well
pandas/core/arrays/categorical.py
Outdated
codes = codes[0] if is_scalar(value) else codes | ||
|
||
return self.codes.searchsorted(codes, side=side, sorter=sorter) | ||
return algorithms.searchsorted(self._codes, value_codes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't have a guarantee that these codes are sorted right? so this seems likely to return an incorrect result if that is not the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added tests for this test_searchsorted_unordered
.
exp = np.array([3, 5], dtype=np.intp) | ||
tm.assert_numpy_array_equal(res_cat, exp) | ||
tm.assert_numpy_array_equal(res_ser, exp) | ||
|
||
# Searching for a single value that is not from the Categorical | ||
pytest.raises(KeyError, lambda: c1.searchsorted('cucumber')) | ||
pytest.raises(KeyError, lambda: s1.searchsorted('cucumber')) | ||
with pytest.raises(KeyError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add match= (if that is appropriate)
categories=['cheese', 'milk', 'apple', 'bread'], | ||
ordered=False) | ||
s2 = Series(c2) | ||
cat = Categorical(['cheese', 'milk', 'apple', 'bread', 'bread'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about a test of the failing case, meaning an unsorted categorical
if you can merge master and remove the 0.24.0 docs. pls ping. |
c0fce18
to
9aca098
Compare
@topper-123 can you merge master and we'll have another look see |
e4931c3
to
9ab6286
Compare
I've rebased. |
@topper-123 can you rebase |
@topper-123 still active? |
b143301
to
c7ebf9f
Compare
Yeah, I've rebased. The PR is quite old, so fixed some strange git issues by just merging the commits into one. |
Taking another look at this again. Read through history but I think I am also on the side of not doing this. Understood on the point about consistency with other aspects of the API but I personally find this confusing, and a doc note isn't a great guard against that. So I think @jreback and I are -1 here, @topper-123 and @jorisvandenbossche are +1. Still accurate? This is oldest PR in queue so would like to decide how to move forward so it isn't in limbo forever |
@WillAyd can you explain what exactly you find confusing? The only thing with Categoricals is that it can be a bit harder to know if your values are sorted or not. But, you can sort an unordered categorical with |
Agree with everything @jorisvandenbossche said. I would like to add that |
I find what is quoted in the docstring to be confusing: If the values are not monotonically sorted, wrong locations
may be returned:
>>> x = pd.Series([2, 1, 3])
>>> x.searchsorted(1)
0 # wrong result, correct would be 1 I understand that this affects non-Categorical right now, but IMO this should be raising for non-monotonic instead of returning surprising results. I don't think expanding that coverage to non-sorted Categorical moves us in the right direction |
Then maybe we could split those two issues? Let's open a separate issue to discuss raising for non-monotonic in general, and move forward with this PR making searchsorted available consistently across dtypes? |
Any more thoughts here? |
I guess this is fine, though IMHO this is scope creep for ordered categoricals. by-definition unordered ones are not ordered (well they have an order, but it is not exposed to the users per se). that being said, I guess for perf reasons this is desireable. |
c7ebf9f
to
a21dc4d
Compare
I've rebased. |
Thanks @topper-123 for the PR and @jorisvandenbossche for input |
git diff upstream/master -u -- "*.py" | flake8 --diff
searchsorted
is mainly a tool to find value locations in a sorted array. As such it doesn't make sense to restrict it to not work with unordered categoricals. For example, doing searchsorted on ordered, but unsorted arrays gives wrong result, so orderedness is a relatively irrelevant thing to check for in searchsorted.The current solution to first convert to a ordered Categorical is inelegant and requires setting up different code paths depending on the situation:
I've also added to the doc string of searchsorted to make clear that sortedness is required for a correct result (this is not an issue that is limited to categoricals, but all Arrays)