Skip to content

PERF: make Categorical.searchsorted not require ordered=True #21667

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
topper-123 opened this issue Jun 28, 2018 · 4 comments · Fixed by #21686
Closed

PERF: make Categorical.searchsorted not require ordered=True #21667

topper-123 opened this issue Jun 28, 2018 · 4 comments · Fixed by #21686
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance

Comments

@topper-123
Copy link
Contributor

searchsorted requires that the searched object is (monotonically) sorted to produce correct results. Orderedness is neither a necessary or sufficient condition to make searchsorted work correctly.

Categorical.searchsorted has a hard check for if the Categorical is ordered:

def searchsorted(self, value, side='left', sorter=None):
if not self.ordered:
raise ValueError("Categorical not ordered\nyou can use "
".as_ordered() to change the Categorical to an "
"ordered one")
from pandas.core.series import Series

This is too strict, as unordered but sorted Categoricals could also benefit from using searchsorted.

I propose removing this check and (like for non-categoricals) let the user have the responsibility to ensure that the Categorical is sorted correctly. This would allow very quick lookups in all sorted Categoricals, whether they're ordered or not.

@chris-b1
Copy link
Contributor

Since the underlying binary search is actually being done on the codes, I wonder if the thought may have have been preventing surprising behavior if code ordering isn't what you expected, e.g. you ended up with something like this that 'feels' sorted, but isn't. At least with as_ordered() it becomes more clear the data is not sorted?

c = pd.Categorical(['a', 'a', 'b', 'b', 'c'], categories=pd.Index(['b', 'a', 'c']))

c
Out[140]: 
[a, a, b, b, c]
Categories (3, object): [b, a, c]

c.as_ordered()
Out[142]: 
[a, a, b, b, c]
Categories (3, object): [b < a < c]

Not necessarily opposed to the change though.

@topper-123
Copy link
Contributor Author

topper-123 commented Jun 28, 2018

Yeah, but the same could be said about sort_values

>>> c.sort_values()
[b, b, a, a, c]
Categories (3, object): [b, a, c]

sort_values does it this way for performance reasons and searchsorted should too, as searchsorted and sort_values are cousins of sorts (a sorted array is required to use searchsorted, so logically they should follow the same rules wrt. orderedness - if you can sort an unordered array, it should also be possible to use searchsorted on it).

@gfyoung gfyoung added Categorical Categorical Data Type Performance Memory or execution speed performance labels Jun 28, 2018
@jorisvandenbossche
Copy link
Member

+1 to change this. searchsorted requires the values to be sorted, but ordered=True is not necessary to sort a Categorical, so IMO also not needed for searchsorted

@topper-123
Copy link
Contributor Author

>>> c = pd.Categorical(['a', 'a', 'b', 'b', 'c'], categories=pd.Index(['b', 'a', 'c']))
>>> s1 = pd.Series(c)
>>> s1.is_monotonic_increasing
False
>>> s2 = s1.sort_values()
>>> s2.is_monotonic_increasing
True

i.e. is_monotonic_increasing works on codes, like sort_values. Seems logical to standardize searchsorted to work on codes also.

@jreback jreback added this to the Contributions Welcome milestone Jun 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants