Skip to content

API: remove ordered requirement in Categorical.searchsorted #21686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 2, 2019

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Jun 29, 2018

searchsorted is mainly a tool to find value locations in a sorted array. As such it doesn't make sense to restrict it to not work with unordered categoricals. For example, doing searchsorted on ordered, but unsorted arrays gives wrong result, so orderedness is a relatively irrelevant thing to check for in searchsorted.

The current solution to first convert to a ordered Categorical is inelegant and requires setting up different code paths depending on the situation:

if categorical.ordered:
    val = c.searchsorted(key)
else:
    val = c.as_ordered().searchsorted(key)

I've also added to the doc string of searchsorted to make clear that sortedness is required for a correct result (this is not an issue that is limited to categoricals, but all Arrays)

@topper-123 topper-123 force-pushed the Categorical.searchsorted branch 2 times, most recently from 05aabc2 to 97040ca Compare June 29, 2018 19:32
@codecov
Copy link

codecov bot commented Jun 29, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@bbfd4ba). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #21686   +/-   ##
=========================================
  Coverage          ?   91.77%           
=========================================
  Files             ?      174           
  Lines             ?    50704           
  Branches          ?        0           
=========================================
  Hits              ?    46535           
  Misses            ?     4169           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.37% <100%> (?)
#single 41.81% <0%> (?)
Impacted Files Coverage Δ
pandas/core/base.py 98.19% <ø> (ø)
pandas/core/arrays/categorical.py 95.91% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bbfd4ba...a21dc4d. Read the comment docs.

@jreback jreback added Categorical Categorical Data Type API Design labels Jul 2, 2018
@jbrockmendel
Copy link
Member

@topper-123 needs rebase

@@ -220,7 +220,8 @@ Bug Fixes
Categorical
^^^^^^^^^^^

-
- :meth:`Categorical.searchsorted` and :meth:`CategoricalIndex.searchsorted`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the standard practice is to keep this as one long line

@@ -1163,6 +1163,16 @@ def factorize(self, sort=False, na_sentinel=-1):
corresponding elements in `value` were inserted before the indices,
the order of `self` would be preserved.

.. note::

The %(klass)s *must* be monotonically sorted, else wrong
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Else wrong” is awkward wording

categories=['cheese', 'milk', 'apple', 'bread'],
ordered=False)
s2 = Series(c2)
c = Categorical(['cheese', 'milk', 'apple', 'bread', 'bread'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind calling these cat and ser or something like that. 1-letter names make grepping tough

@topper-123 topper-123 force-pushed the Categorical.searchsorted branch from 97040ca to e553898 Compare August 18, 2018 23:17
@jreback
Copy link
Contributor

jreback commented Aug 19, 2018

this needs evaluation
I don’t think we want to allow this
you can easily get incorrect results

@topper-123
Copy link
Contributor Author

I discuss some of my points in favor of this in #21667:

In searchsorted its already not possible to guard against incorrect result. Also letting this into the code base (.searchsorted on .codes) would just mimic existing sort_values behaviour (sort on .code):

>>> c = pd.Categorical(['a', 'a', 'b', 'b', 'c'], categories=pd.Index(['b', 'a', 'c']))
>>> c.sort_values()
[b, b, a, a, c]
Categories (3, object): [b, a, c]

We can perhaps continue the discussion in #21667.

@jorisvandenbossche
Copy link
Member

I agree with @topper-123: it is indeed easy to get wrong results with searchsorted, but that is a general issue with that function, and not necessarily specific to Categorical.searchsorted.
searchsorted requires the values to be sorted, but ordered=True is not necessary to sort a Categorical, so IMO should also not be needed for searchsorted

@jreback
Copy link
Contributor

jreback commented Aug 20, 2018

.sort_values() is a red-herring here, we are simply lexographically sorting which is expected. Its too easy to make mistakes with .searchsorted here. A doc warning is not great. I would leave this like it is.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 20, 2018

.sort_values() is a red-herring here, we are simply lexographically sorting which is expected.

Yes, and this sorting will mean that the codes are sorted, so I don't understand why the above is a reason to not do this?

@jreback
Copy link
Contributor

jreback commented Aug 22, 2018

Yes, and this sorting will mean that the codes are sorted, so I don't understand why the above is a reason to not do this?

I think its pretty unituitive to sort by codes in an unordered categorical, we should actually disallow .sort_values() as well (or maybe allow it but sort by the lexographic categories themselves).

e.g. we should raise just like we do for min/max.

In [1]: c = pd.Categorical(list('AABCD'))

In [2]: c.min()
TypeError: Categorical is not ordered for operation min
you can use .as_ordered() to change the Categorical to an ordered one


In [3]: c.sort_values()
Out[3]: 
[A, A, B, C, D]
Categories (4, object): [A, B, C, D]

@jorisvandenbossche
Copy link
Member

Being able to sort is very useful, we discussed this before, so I am -1 on removing that ability (but if you want further discuss this, I would open a new issue).

Note that, in the majority of the use cases for unordered categoricals, sorting by the codes or sorting lexicographically will be identical, exactly because we do sort the categories lexicographically on creation.
So only when passing the categories manually and not setting ordered=True, this might be different.

@topper-123
Copy link
Contributor Author

I agree with @jorisvandenbossche.

IMO It is beneficial to be able find locations of groups of values unordered values, so lhs, rhs = c.searchssorted(val1, side='left'), c.searchsorted(val1, side='right') should be possible, also for unordered categoricals, because this the fastest way to look up locations (much faster than c[c==val1].

To do the above, unordered categoricals need to have sort_values, is_monotonic* and searchsorted...

@mroeschke
Copy link
Member

Looks like this stalled. In the meantime, this could use a rebase..

@WillAyd
Copy link
Member

WillAyd commented Feb 27, 2019

Closing as stale. Ping if you'd like to continue

@WillAyd WillAyd closed this Feb 27, 2019
@topper-123
Copy link
Contributor Author

I would actually very much like to continue this, but was under the impression it wasn't going to get in. If you'll open it, I'll push a rebase.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2019

will have another look

@jreback jreback reopened this Feb 28, 2019
@topper-123 topper-123 force-pushed the Categorical.searchsorted branch from e553898 to b71e252 Compare March 2, 2019 09:24
@topper-123 topper-123 changed the title PERF/API: remove ordered requirement in Categorical.searchsorted API: remove ordered requirement in Categorical.searchsorted Mar 2, 2019
@topper-123 topper-123 force-pushed the Categorical.searchsorted branch from b71e252 to 21e0056 Compare March 2, 2019 22:14
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why you think its a good idea to enable this. This looks like it is just cause for trouble.

@@ -0,0 +1,731 @@
.. _whatsnew_0240:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is included somehow, pls revert

@@ -127,7 +127,7 @@ Bug Fixes
Categorical
^^^^^^^^^^^

-
- :meth:`Categorical.searchsorted` and :meth:`CategoricalIndex.searchsorted` now work on unordered categoricals also (:issue:`21667`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also -> as well

codes = codes[0] if is_scalar(value) else codes

return self.codes.searchsorted(codes, side=side, sorter=sorter)
return algorithms.searchsorted(self._codes, value_codes,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have a guarantee that these codes are sorted right? so this seems likely to return an incorrect result if that is not the case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added tests for this test_searchsorted_unordered.

exp = np.array([3, 5], dtype=np.intp)
tm.assert_numpy_array_equal(res_cat, exp)
tm.assert_numpy_array_equal(res_ser, exp)

# Searching for a single value that is not from the Categorical
pytest.raises(KeyError, lambda: c1.searchsorted('cucumber'))
pytest.raises(KeyError, lambda: s1.searchsorted('cucumber'))
with pytest.raises(KeyError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add match= (if that is appropriate)

categories=['cheese', 'milk', 'apple', 'bread'],
ordered=False)
s2 = Series(c2)
cat = Categorical(['cheese', 'milk', 'apple', 'bread', 'bread'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about a test of the failing case, meaning an unsorted categorical

@jreback
Copy link
Contributor

jreback commented Mar 26, 2019

if you can merge master and remove the 0.24.0 docs. pls ping.

@topper-123 topper-123 force-pushed the Categorical.searchsorted branch 2 times, most recently from c0fce18 to 9aca098 Compare April 1, 2019 17:39
@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

@topper-123 can you merge master and we'll have another look see

@topper-123 topper-123 force-pushed the Categorical.searchsorted branch 2 times, most recently from e4931c3 to 9ab6286 Compare June 8, 2019 21:53
@topper-123
Copy link
Contributor Author

I've rebased.

@jbrockmendel
Copy link
Member

@topper-123 can you rebase

@WillAyd
Copy link
Member

WillAyd commented Sep 13, 2019

@topper-123 still active?

@topper-123 topper-123 force-pushed the Categorical.searchsorted branch 4 times, most recently from b143301 to c7ebf9f Compare September 16, 2019 21:11
@topper-123
Copy link
Contributor Author

Yeah, I've rebased. The PR is quite old, so fixed some strange git issues by just merging the commits into one.

@WillAyd
Copy link
Member

WillAyd commented Sep 20, 2019

Taking another look at this again. Read through history but I think I am also on the side of not doing this. Understood on the point about consistency with other aspects of the API but I personally find this confusing, and a doc note isn't a great guard against that.

So I think @jreback and I are -1 here, @topper-123 and @jorisvandenbossche are +1. Still accurate?

This is oldest PR in queue so would like to decide how to move forward so it isn't in limbo forever

@jorisvandenbossche
Copy link
Member

@WillAyd can you explain what exactly you find confusing?
Note that the doc note is added to the generic searchsorted docstring, as this point is true for all dtypes (you always need to ensure yourself that the values are sorted when calling this method). There is nothing categorical specific about this gotcha with searchsorted.

The only thing with Categoricals is that it can be a bit harder to know if your values are sorted or not. But, you can sort an unordered categorical with sort_values, you can check if they sorted with is_monotonic.
And for searchsorted to work, the only requirement is sortedness, which an unordered categorical still has.

@topper-123
Copy link
Contributor Author

Agree with everything @jorisvandenbossche said.

I would like to add that searchsorted is used to get fast lookups, so for performance sensitive work, having searchsorted is necessary.

@WillAyd
Copy link
Member

WillAyd commented Sep 20, 2019

I find what is quoted in the docstring to be confusing:

If the values are not monotonically sorted, wrong locations
   may be returned:

>>> x = pd.Series([2, 1, 3])
>>> x.searchsorted(1)
0  # wrong result, correct would be 1

I understand that this affects non-Categorical right now, but IMO this should be raising for non-monotonic instead of returning surprising results. I don't think expanding that coverage to non-sorted Categorical moves us in the right direction

@jorisvandenbossche
Copy link
Member

I understand that this affects non-Categorical right now, but IMO this should be raising for non-monotonic instead of returning surprising results

Then maybe we could split those two issues? Let's open a separate issue to discuss raising for non-monotonic in general, and move forward with this PR making searchsorted available consistently across dtypes?

@jorisvandenbossche
Copy link
Member

Any more thoughts here?

@jreback
Copy link
Contributor

jreback commented Oct 1, 2019

I guess this is fine, though IMHO this is scope creep for ordered categoricals. by-definition unordered ones are not ordered (well they have an order, but it is not exposed to the users per se). that being said, I guess for perf reasons this is desireable.

@topper-123 topper-123 force-pushed the Categorical.searchsorted branch from c7ebf9f to a21dc4d Compare October 1, 2019 18:42
@topper-123
Copy link
Contributor Author

I've rebased.

@WillAyd WillAyd added this to the 1.0 milestone Oct 2, 2019
@WillAyd WillAyd merged commit df10e3f into pandas-dev:master Oct 2, 2019
@WillAyd
Copy link
Member

WillAyd commented Oct 2, 2019

Thanks @topper-123 for the PR and @jorisvandenbossche for input

@topper-123 topper-123 deleted the Categorical.searchsorted branch October 2, 2019 21:21
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: make Categorical.searchsorted not require ordered=True
6 participants