-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: make Categorical.searchsorted not require ordered=True #21667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Since the underlying binary search is actually being done on the codes, I wonder if the thought may have have been preventing surprising behavior if code ordering isn't what you expected, e.g. you ended up with something like this that 'feels' sorted, but isn't. At least with c = pd.Categorical(['a', 'a', 'b', 'b', 'c'], categories=pd.Index(['b', 'a', 'c']))
c
Out[140]:
[a, a, b, b, c]
Categories (3, object): [b, a, c]
c.as_ordered()
Out[142]:
[a, a, b, b, c]
Categories (3, object): [b < a < c] Not necessarily opposed to the change though. |
Yeah, but the same could be said about >>> c.sort_values()
[b, b, a, a, c]
Categories (3, object): [b, a, c] sort_values does it this way for performance reasons and searchsorted should too, as searchsorted and sort_values are cousins of sorts (a sorted array is required to use searchsorted, so logically they should follow the same rules wrt. orderedness - if you can sort an unordered array, it should also be possible to use searchsorted on it). |
+1 to change this. |
>>> c = pd.Categorical(['a', 'a', 'b', 'b', 'c'], categories=pd.Index(['b', 'a', 'c']))
>>> s1 = pd.Series(c)
>>> s1.is_monotonic_increasing
False
>>> s2 = s1.sort_values()
>>> s2.is_monotonic_increasing
True i.e. |
searchsorted
requires that the searched object is (monotonically) sorted to produce correct results. Orderedness is neither a necessary or sufficient condition to make searchsorted work correctly.Categorical.searchsorted
has a hard check for if the Categorical is ordered:pandas/pandas/core/arrays/categorical.py
Lines 1382 to 1388 in e0f978d
This is too strict, as unordered but sorted Categoricals could also benefit from using searchsorted.
I propose removing this check and (like for non-categoricals) let the user have the responsibility to ensure that the Categorical is sorted correctly. This would allow very quick lookups in all sorted Categoricals, whether they're ordered or not.
The text was updated successfully, but these errors were encountered: