Skip to content

Categorical.searchsorted() uses lexical order instead of the provided categorical order #14522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
golobor opened this issue Oct 27, 2016 · 4 comments
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@golobor
Copy link

golobor commented Oct 27, 2016

Hi,
it seems that the searchsorted() method of Categorical series does not take into account the specific order of the categories.

An slightly modified example from the API documentation

x = pd.Categorical(
    ['cheese', 'apple', 'bread', 'bread',  'milk'],
    categories=['cheese', 'milk', 'apple', 'bread'],
    ordered=True)
print('Unsorted:', x)
x_sort = x.sort_values()
print('Sorted:', x_sort)
print('Searchsorted apple:', x_sort.searchsorted('apple'))
print('Searchsorted milk:', x_sort.searchsorted('milk'))

Output

Unsorted: [cheese, apple, bread, bread, milk]
Categories (4, object): [cheese < milk < apple < bread]
Sorted: [cheese, milk, apple, bread, bread]
Categories (4, object): [cheese < milk < apple < bread]
Searchsorted apple: [0]
Searchsorted milk: [5]

As you can see, "apple" is inserted at the beginning (b/c it starts with an 'a') and "milk" is inserted at the end, even though the order of both categories is between "cheese" and "bread".
Unfortunately, the API documentation does not specify is this is the expected behavior, but it seems inconsistent with the fact that .sort_values() does use the categorical order.

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-34-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0
nose: None
pip: 8.1.2
setuptools: 25.1.6
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.0
statsmodels: 0.6.1
xarray: None
IPython: 5.0.0
sphinx: None
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.1
matplotlib: 1.5.1
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: None
lxml: None
bs4: 4.5.1
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Oct 28, 2016

yeah this may not be respecting ordered. a PR to fix would be welcome.

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type Difficulty Intermediate labels Oct 28, 2016
@jreback jreback added this to the Next Major Release milestone Oct 28, 2016
chiangqiqi added a commit to chiangqiqi/pandas that referenced this issue Oct 31, 2016
@chiangqiqi
Copy link

chiangqiqi commented Nov 2, 2016

this searchsorted seems having more problem than lexical order.

In [35]: c1
Out[35]:
[apple, bread, bread, cheese, milk]
Categories (4, object): [apple < bread < cheese < milk]

In [36]: c1.searchsorted(["eggs"])
Out[36]: array([4])

In [37]: c1.searchsorted(["milk"])
Out[37]: array([4])

is this result right?

@nathalier
Copy link
Contributor

I'd like to work on it.
What behavior is expected if some of values are not in categories?
Should:
a) exception be raised
b) -1 be returned for such values
c) the leftmost or the rightmost indices returned without notifications?

b) option looks to be the most convenient. What do you think?

@jorisvandenbossche
Copy link
Member

I would say that raising an Exception is the more logical thing to do, as you can never insert such a non-category in the categorical (this would also raise an error).

nathalier added a commit to nathalier/pandas that referenced this issue Nov 20, 2016
Previously, it used lexical order instead of the provided categorical
order.

Tests updated accordingly.

Closes pandas-dev#14522
nathalier added a commit to nathalier/pandas that referenced this issue Nov 20, 2016
Previously, it used lexical order instead of the provided categorical
order.

Tests updated accordingly.

Closes pandas-dev#14522
nathalier added a commit to nathalier/pandas that referenced this issue Dec 18, 2016
Previously, it used lexical order instead of the provided categorical
order.

Tests updated accordingly.

Closes pandas-dev#14522
@jreback jreback modified the milestones: 0.20.0, Next Major Release Dec 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
5 participants