Skip to content

BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented May 13, 2018

CategoricalIndex.searchsorted returns the wrong shape for scalar input. Numpy arrays and all other index types return a scalar if the input is a scalar, but the CategoricalIndex does not

For example

>>> import numpy as np
>>> np.array([1, 2, 3]).searchsorted(1)
0
>>> np.array([1, 2, 3]).searchsorted([1])
array([0])
>>> import pandas as pd
>>> pd.Index([1, 2, 3]).searchsorted(1)
0
>>> pd.Index([1, 2, 3]).searchsorted([1])
array([0])

This issue also affects slicing on sorted/ordered categoricals, which is why I've written another test for the slicing.

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
  • example in categoricals.rst

@codecov
Copy link

codecov bot commented May 13, 2018

Codecov Report

Merging #21019 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21019      +/-   ##
==========================================
+ Coverage   91.85%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49549    49551       +2     
==========================================
+ Hits        45512    45514       +2     
  Misses       4037     4037
Flag Coverage Δ
#multiple 90.25% <100%> (ø) ⬆️
#single 41.87% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/category.py 97.03% <ø> (ø) ⬆️
pandas/core/arrays/categorical.py 95.7% <100%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c65c57...04ca52f. Read the comment docs.

@@ -1259,6 +1259,9 @@ Indexing
- Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`)
- Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`)
- Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`)
- Bug in ``CategoricalIndex.searchsorted`` where the method didn't return a scalar when the input values was scalar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add this PR number (21019) as the referenced issue if there isn't already one.

@fjetter fjetter force-pushed the bugfix/categorical_slicing branch 2 times, most recently from ff210ef to 01ade5a Compare May 13, 2018 12:01
codes = self.categories.get_loc(key)
if (codes == -1):
raise KeyError(key)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is redundant, KeyError is already raised by get_loc

@@ -1341,6 +1341,8 @@ def searchsorted(self, value, side='left', sorter=None):

if -1 in values_as_codes:
raise ValueError("Value(s) to be inserted must be in categories.")
if is_scalar(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would rather do this in pandas/core/base.py/searchsorted

use is_scalar rather than a numpy function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the issue is rather with the helper function _get_codes_for_values which always returns an array. I didn't want to change it there since the way it is written right now only works for array like objects. In base.py we're already calling searchsorted directly on the numpy array, i.e. it obeys the in/output shape
  • I'm using is_scalar here, is this wrong? Are you referring to the np.asscalar? I couldn't find a suitable pandas function for that (other than ~ values[0])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I c, change here is ok
don't use np.asscalar, rather use .item()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I found out in #21699, numpy.searchsorted doesn't like python ints, but needs numpy ints to archieve its speed.

>>> n = 1_000_000
>>> c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n), ordered=True)
>>> %timeit c.codes.searchsorted(1)  # python int
7 ms ± 24.7 µs per loop
>>> c.codes.dtype
int8
>>> %timeit c.codes.searchsorted(np.int8(1))
2.46 µs ± 82.4 ns per loop

So the scalar version should be values_as_codes = values_as_codes[0] to avoid speed loss.

exp = np.array([2], dtype=np.intp)
tm.assert_numpy_array_equal(res_cat, exp)
tm.assert_numpy_array_equal(res_ser, exp)
exp = np.int64(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, odd that this doesn't fail, this should be a platform indexer (intp)

@@ -1259,6 +1259,9 @@ Indexing
- Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`)
- Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`)
- Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`)
- Bug in ``CategoricalIndex.searchsorted`` where the method didn't return a scalar when the input values was scalar (:issue:`21019`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the :func: syntax

didn't -> did not

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Categorical Categorical Data Type labels May 13, 2018
@jreback
Copy link
Contributor

jreback commented May 13, 2018

does this have an associated issue? (pls do a search if not)

@topper-123
Copy link
Contributor

topper-123 commented May 13, 2018

Could this be a reason the slowness seen in #20395? (i.e. searchsorted returning wrong type, and then pandas taking a different, slower, code path...)

@fjetter
Copy link
Member Author

fjetter commented May 13, 2018

@jreback The only issue I could find which seems to be related is #9748 where an open TODO is the slicing on categoricals. From what I can see slicing is still not working for integers, though.

@topper-123 I don't think these are related. You should only hit the searchsorted code path during range slicing, e.g. df[1:2]

@topper-123
Copy link
Contributor

@fjetter, actually that operation is/should be slicing and use searchsorted, as the index is a monotonic index, and 'b' is not unique.

I'll look into that again with this angle. I won't high-jack this thread anymore.

@@ -1341,6 +1341,8 @@ def searchsorted(self, value, side='left', sorter=None):

if -1 in values_as_codes:
raise ValueError("Value(s) to be inserted must be in categories.")
if is_scalar(value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I c, change here is ok
don't use np.asscalar, rather use .item()

result = ordered_df.loc["a":"e"]
assert_frame_equal(result, ordered_df)

df_slice = ordered_df.loc["a":"b"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this result looks suspect. both a and b are in the categories and its ordered?

also, don't use label based indexers to select the expected, rather use .iloc so there is no ambiguity (IOW you are making an expected value where it is not clear what is the answer)

@fjetter fjetter force-pushed the bugfix/categorical_slicing branch from e100e74 to ec3e07c Compare May 25, 2018 06:55
@fjetter
Copy link
Member Author

fjetter commented May 25, 2018

I refactored the tests and hope the intention is a bit clearer now. Slicing of the categorical should behave similar to a ordinary index (at least if it is ordered)

@fjetter
Copy link
Member Author

fjetter commented May 25, 2018

Tests fail because of an import error of geopandas. I have no idea what might cause this, though.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. just some comment on tests. I think it might be worthwhile to update the categorical.rst with a small example of this as well.

@@ -1289,6 +1289,9 @@ Indexing
- Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`)
- Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`)
- Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`)
- Bug in :func:`CategoricalIndex.searchsorted` where the method did not return a scalar when the input values was scalar (:issue:`21019`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.23.1

@@ -1289,6 +1289,9 @@ Indexing
- Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`)
- Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`)
- Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`)
- Bug in :func:`CategoricalIndex.searchsorted` where the method did not return a scalar when the input values was scalar (:issue:`21019`)
- Bug in :class:`CategoricalIndex` where slicing beyond the range of the data raised a KeyError (:issue:`21019`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use double backticks on KeyError

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see you added to 0.23.1 below, ok, make these changes there and revert this one

# slicing
# not implemented ATM
# GH9748
# Raises KeyError since the left slice 'a' is not unique
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this issue reference here (gh-....)

# GH9748
# Raises KeyError since the left slice 'a' is not unique
pytest.raises(KeyError, lambda: self.df.loc["a":"b"])
result = self.df.loc["b":"c"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this tests is the same as the 3rd case? if so do we need both? (or if not can you move them together and comment)

# right/left edge we should get the original slice again.
result = ordered_df.loc["a": "d"]
assert_frame_equal(result, ordered_df)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also test the left edge as well

@jreback
Copy link
Contributor

jreback commented May 25, 2018

pls rebase on master (ci failures are fixed)

@fjetter fjetter force-pushed the bugfix/categorical_slicing branch from ec3e07c to 04ca52f Compare June 6, 2018 18:02
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you rebase

@@ -88,7 +88,8 @@ Indexing
- Bug in :meth:`MultiIndex.set_names` where error raised for a ``MultiIndex`` with ``nlevels == 1`` (:issue:`21149`)
- Bug in :class:`IntervalIndex` constructors where creating an ``IntervalIndex`` from categorical data was not fully supported (:issue:`21243`, issue:`21253`)
- Bug in :meth:`MultiIndex.sort_index` which was not guaranteed to sort correctly with ``level=1``; this was also causing data misalignment in particular :meth:`DataFrame.stack` operations (:issue:`20994`, :issue:`20945`, :issue:`21052`)
-
- Bug in :func:`CategoricalIndex.searchsorted` where the method did not return a scalar when the input values was scalar (:issue:`21019`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to 0.23.2

@topper-123
Copy link
Contributor

Hey @fjetter , want to follow this to the door?

@jreback
Copy link
Contributor

jreback commented Nov 1, 2018

can you merge master and can see where this is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants