BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019

fjetter · 2018-05-13T09:51:10Z

CategoricalIndex.searchsorted returns the wrong shape for scalar input. Numpy arrays and all other index types return a scalar if the input is a scalar, but the CategoricalIndex does not

For example

>>> import numpy as np
>>> np.array([1, 2, 3]).searchsorted(1)
0
>>> np.array([1, 2, 3]).searchsorted([1])
array([0])
>>> import pandas as pd
>>> pd.Index([1, 2, 3]).searchsorted(1)
0
>>> pd.Index([1, 2, 3]).searchsorted([1])
array([0])

This issue also affects slicing on sorted/ordered categoricals, which is why I've written another test for the slicing.

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
example in categoricals.rst

codecov · 2018-05-13T10:37:05Z

Codecov Report

Merging #21019 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21019      +/-   ##
==========================================
+ Coverage   91.85%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49549    49551       +2     
==========================================
+ Hits        45512    45514       +2     
  Misses       4037     4037

Flag	Coverage Δ
#multiple	`90.25% <100%> (ø)`	⬆️
#single	`41.87% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/indexes/category.py	`97.03% <ø> (ø)`	⬆️
pandas/core/arrays/categorical.py	`95.7% <100%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c65c57...04ca52f. Read the comment docs.

TomAugspurger · 2018-05-13T11:22:34Z

doc/source/whatsnew/v0.23.0.txt

@@ -1259,6 +1259,9 @@ Indexing
 - Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`)
 - Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`)
 - Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`)
+- Bug in ``CategoricalIndex.searchsorted`` where the method didn't return a scalar when the input values was scalar


You can add this PR number (21019) as the referenced issue if there isn't already one.

jreback · 2018-05-13T13:20:49Z

pandas/core/indexes/category.py

-        codes = self.categories.get_loc(key)
-        if (codes == -1):
-            raise KeyError(key)
+        try:


this is redundant, KeyError is already raised by get_loc

jreback · 2018-05-13T13:25:35Z

pandas/core/arrays/categorical.py

@@ -1341,6 +1341,8 @@ def searchsorted(self, value, side='left', sorter=None):

        if -1 in values_as_codes:
            raise ValueError("Value(s) to be inserted must be in categories.")
+        if is_scalar(value):


would rather do this in pandas/core/base.py/searchsorted

use is_scalar rather than a numpy function

the issue is rather with the helper function _get_codes_for_values which always returns an array. I didn't want to change it there since the way it is written right now only works for array like objects. In base.py we're already calling searchsorted directly on the numpy array, i.e. it obeys the in/output shape

I'm using is_scalar here, is this wrong? Are you referring to the np.asscalar? I couldn't find a suitable pandas function for that (other than ~ values[0])

ok I c, change here is ok
don't use np.asscalar, rather use .item()

As I found out in #21699, numpy.searchsorted doesn't like python ints, but needs numpy ints to archieve its speed.

>>> n = 1_000_000 >>> c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n), ordered=True) >>> %timeit c.codes.searchsorted(1) # python int 7 ms ± 24.7 µs per loop >>> c.codes.dtype int8 >>> %timeit c.codes.searchsorted(np.int8(1)) 2.46 µs ± 82.4 ns per loop

So the scalar version should be values_as_codes = values_as_codes[0] to avoid speed loss.

jreback · 2018-05-13T13:28:02Z

pandas/tests/categorical/test_analytics.py

-        exp = np.array([2], dtype=np.intp)
-        tm.assert_numpy_array_equal(res_cat, exp)
-        tm.assert_numpy_array_equal(res_ser, exp)
+        exp = np.int64(2)


hmm, odd that this doesn't fail, this should be a platform indexer (intp)

jreback · 2018-05-13T13:29:16Z

doc/source/whatsnew/v0.23.0.txt

@@ -1259,6 +1259,9 @@ Indexing
 - Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`)
 - Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`)
 - Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`)
+- Bug in ``CategoricalIndex.searchsorted`` where the method didn't return a scalar when the input values was scalar (:issue:`21019`)


use the :func: syntax

didn't -> did not

jreback · 2018-05-13T13:29:55Z

does this have an associated issue? (pls do a search if not)

topper-123 · 2018-05-13T15:38:23Z

Could this be a reason the slowness seen in #20395? (i.e. searchsorted returning wrong type, and then pandas taking a different, slower, code path...)

fjetter · 2018-05-13T15:46:47Z

@jreback The only issue I could find which seems to be related is #9748 where an open TODO is the slicing on categoricals. From what I can see slicing is still not working for integers, though.

@topper-123 I don't think these are related. You should only hit the searchsorted code path during range slicing, e.g. df[1:2]

topper-123 · 2018-05-13T16:16:01Z

@fjetter, actually that operation is/should be slicing and use searchsorted, as the index is a monotonic index, and 'b' is not unique.

I'll look into that again with this angle. I won't high-jack this thread anymore.

jreback · 2018-05-17T10:23:05Z

pandas/core/arrays/categorical.py

@@ -1341,6 +1341,8 @@ def searchsorted(self, value, side='left', sorter=None):

        if -1 in values_as_codes:
            raise ValueError("Value(s) to be inserted must be in categories.")
+        if is_scalar(value):


ok I c, change here is ok
don't use np.asscalar, rather use .item()

jreback · 2018-05-17T10:26:48Z

pandas/tests/indexing/test_categorical.py

+        result = ordered_df.loc["a":"e"]
+        assert_frame_equal(result, ordered_df)
+
+        df_slice = ordered_df.loc["a":"b"]


this result looks suspect. both a and b are in the categories and its ordered?

also, don't use label based indexers to select the expected, rather use .iloc so there is no ambiguity (IOW you are making an expected value where it is not clear what is the answer)

fjetter · 2018-05-25T09:02:54Z

I refactored the tests and hope the intention is a bit clearer now. Slicing of the categorical should behave similar to a ordinary index (at least if it is ordered)

fjetter · 2018-05-25T09:03:25Z

Tests fail because of an import error of geopandas. I have no idea what might cause this, though.

jreback

lgtm. just some comment on tests. I think it might be worthwhile to update the categorical.rst with a small example of this as well.

jreback · 2018-05-25T11:34:12Z

doc/source/whatsnew/v0.23.0.txt

@@ -1289,6 +1289,9 @@ Indexing
 - Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`)
 - Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`)
 - Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`)
+- Bug in :func:`CategoricalIndex.searchsorted` where the method did not return a scalar when the input values was scalar (:issue:`21019`)


move to 0.23.1

jreback · 2018-05-25T11:34:21Z

doc/source/whatsnew/v0.23.0.txt

@@ -1289,6 +1289,9 @@ Indexing
 - Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`)
 - Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`)
 - Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`)
+- Bug in :func:`CategoricalIndex.searchsorted` where the method did not return a scalar when the input values was scalar (:issue:`21019`)
+- Bug in :class:`CategoricalIndex` where slicing beyond the range of the data raised a KeyError (:issue:`21019`)
+


use double backticks on KeyError

i see you added to 0.23.1 below, ok, make these changes there and revert this one

jreback · 2018-05-25T11:35:33Z

pandas/tests/indexing/test_categorical.py

-        # slicing
-        # not implemented ATM
-        # GH9748
+        # Raises KeyError since the left slice 'a' is not unique


can you add this issue reference here (gh-....)

jreback · 2018-05-25T11:36:32Z

pandas/tests/indexing/test_categorical.py

-        # GH9748
+        # Raises KeyError since the left slice 'a' is not unique
+        pytest.raises(KeyError, lambda: self.df.loc["a":"b"])
+        result = self.df.loc["b":"c"]


this tests is the same as the 3rd case? if so do we need both? (or if not can you move them together and comment)

jreback · 2018-05-25T11:37:31Z

pandas/tests/indexing/test_categorical.py

+        # right/left edge we should get the original slice again.
+        result = ordered_df.loc["a": "d"]
+        assert_frame_equal(result, ordered_df)
+


can you also test the left edge as well

jreback · 2018-05-25T11:38:21Z

pls rebase on master (ci failures are fixed)

jreback

can you rebase

jreback · 2018-06-19T01:47:33Z

doc/source/whatsnew/v0.23.1.txt

@@ -88,7 +88,8 @@ Indexing
 - Bug in :meth:`MultiIndex.set_names` where error raised for a ``MultiIndex`` with ``nlevels == 1`` (:issue:`21149`)
 - Bug in :class:`IntervalIndex` constructors where creating an ``IntervalIndex`` from categorical data was not fully supported (:issue:`21243`, issue:`21253`)
 - Bug in :meth:`MultiIndex.sort_index` which was not guaranteed to sort correctly with ``level=1``; this was also causing data misalignment in particular :meth:`DataFrame.stack` operations (:issue:`20994`, :issue:`20945`, :issue:`21052`)
-
+- Bug in :func:`CategoricalIndex.searchsorted` where the method did not return a scalar when the input values was scalar (:issue:`21019`)


can you move to 0.23.2

topper-123 · 2018-06-29T17:53:50Z

Hey @fjetter , want to follow this to the door?

jreback · 2018-11-01T01:43:43Z

can you merge master and can see where this is

TomAugspurger reviewed May 13, 2018

View reviewed changes

fjetter force-pushed the bugfix/categorical_slicing branch 2 times, most recently from ff210ef to 01ade5a Compare May 13, 2018 12:01

jreback requested changes May 13, 2018

View reviewed changes

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type labels May 13, 2018

jreback requested changes May 17, 2018

View reviewed changes

fjetter force-pushed the bugfix/categorical_slicing branch from e100e74 to ec3e07c Compare May 25, 2018 06:55

jreback requested changes May 25, 2018

View reviewed changes

fjetter added 6 commits June 6, 2018 20:01

categorical: searchsorted returns a scalar if input was scalar

bd3d440

Address review comments

c4249fa

Use item instead of np.asscalar to select value

1c25d65

refactor categorical loc slicing tests

d4e9879

Changelog entry

25b5fd7

Address PR comments

04ca52f

fjetter force-pushed the bugfix/categorical_slicing branch from ec3e07c to 04ca52f Compare June 6, 2018 18:02

jreback requested changes Jun 19, 2018

View reviewed changes

topper-123 mentioned this pull request Nov 2, 2018

API: Make Categorical.searchsorted returns a scalar when supplied a scalar #23466

Merged

4 tasks

jreback added this to the 0.24.0 milestone Nov 3, 2018

jreback closed this in #23466 Nov 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019

BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019

fjetter commented May 13, 2018 •

edited

Loading

codecov bot commented May 13, 2018 •

edited

Loading

TomAugspurger May 13, 2018

jreback May 13, 2018

jreback May 13, 2018

fjetter May 13, 2018

jreback May 17, 2018

topper-123 Jul 23, 2018

jreback May 13, 2018

jreback May 13, 2018

jreback commented May 13, 2018

topper-123 commented May 13, 2018 •

edited

Loading

fjetter commented May 13, 2018

topper-123 commented May 13, 2018

jreback May 17, 2018

jreback May 17, 2018

fjetter commented May 25, 2018

fjetter commented May 25, 2018

jreback left a comment

jreback May 25, 2018

jreback May 25, 2018

jreback May 25, 2018

jreback May 25, 2018

jreback May 25, 2018

jreback May 25, 2018

jreback commented May 25, 2018

jreback left a comment

jreback Jun 19, 2018

topper-123 commented Jun 29, 2018

jreback commented Nov 1, 2018

BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019

BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019

Conversation

fjetter commented May 13, 2018 • edited Loading

codecov bot commented May 13, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 13, 2018

topper-123 commented May 13, 2018 • edited Loading

fjetter commented May 13, 2018

topper-123 commented May 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjetter commented May 25, 2018

fjetter commented May 25, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 25, 2018

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented Jun 29, 2018

jreback commented Nov 1, 2018

fjetter commented May 13, 2018 •

edited

Loading

codecov bot commented May 13, 2018 •

edited

Loading

topper-123 commented May 13, 2018 •

edited

Loading