-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: CategoricalIndex.searchsorted doesn't return a scalar if input was scalar #21019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #21019 +/- ##
==========================================
+ Coverage 91.85% 91.85% +<.01%
==========================================
Files 153 153
Lines 49549 49551 +2
==========================================
+ Hits 45512 45514 +2
Misses 4037 4037
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -1259,6 +1259,9 @@ Indexing | |||
- Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`) | |||
- Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`) | |||
- Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`) | |||
- Bug in ``CategoricalIndex.searchsorted`` where the method didn't return a scalar when the input values was scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add this PR number (21019) as the referenced issue if there isn't already one.
ff210ef
to
01ade5a
Compare
pandas/core/indexes/category.py
Outdated
codes = self.categories.get_loc(key) | ||
if (codes == -1): | ||
raise KeyError(key) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is redundant, KeyError is already raised by get_loc
@@ -1341,6 +1341,8 @@ def searchsorted(self, value, side='left', sorter=None): | |||
|
|||
if -1 in values_as_codes: | |||
raise ValueError("Value(s) to be inserted must be in categories.") | |||
if is_scalar(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would rather do this in pandas/core/base.py/searchsorted
use is_scalar
rather than a numpy function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the issue is rather with the helper function
_get_codes_for_values
which always returns an array. I didn't want to change it there since the way it is written right now only works for array like objects. Inbase.py
we're already callingsearchsorted
directly on the numpy array, i.e. it obeys the in/output shape - I'm using
is_scalar
here, is this wrong? Are you referring to thenp.asscalar
? I couldn't find a suitable pandas function for that (other than ~values[0]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I c, change here is ok
don't use np.asscalar, rather use .item()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I found out in #21699, numpy.searchsorted doesn't like python ints, but needs numpy ints to archieve its speed.
>>> n = 1_000_000
>>> c = pd.Categorical(list('a' * n + 'b' * n + 'c' * n), ordered=True)
>>> %timeit c.codes.searchsorted(1) # python int
7 ms ± 24.7 µs per loop
>>> c.codes.dtype
int8
>>> %timeit c.codes.searchsorted(np.int8(1))
2.46 µs ± 82.4 ns per loop
So the scalar version should be values_as_codes = values_as_codes[0]
to avoid speed loss.
exp = np.array([2], dtype=np.intp) | ||
tm.assert_numpy_array_equal(res_cat, exp) | ||
tm.assert_numpy_array_equal(res_ser, exp) | ||
exp = np.int64(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, odd that this doesn't fail, this should be a platform indexer (intp
)
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -1259,6 +1259,9 @@ Indexing | |||
- Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`) | |||
- Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`) | |||
- Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`) | |||
- Bug in ``CategoricalIndex.searchsorted`` where the method didn't return a scalar when the input values was scalar (:issue:`21019`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the :func:
syntax
didn't -> did not
does this have an associated issue? (pls do a search if not) |
Could this be a reason the slowness seen in #20395? (i.e. searchsorted returning wrong type, and then pandas taking a different, slower, code path...) |
@jreback The only issue I could find which seems to be related is #9748 where an open TODO is the slicing on categoricals. From what I can see slicing is still not working for integers, though. @topper-123 I don't think these are related. You should only hit the |
@fjetter, actually that operation is/should be slicing and use searchsorted, as the index is a monotonic index, and I'll look into that again with this angle. I won't high-jack this thread anymore. |
@@ -1341,6 +1341,8 @@ def searchsorted(self, value, side='left', sorter=None): | |||
|
|||
if -1 in values_as_codes: | |||
raise ValueError("Value(s) to be inserted must be in categories.") | |||
if is_scalar(value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I c, change here is ok
don't use np.asscalar, rather use .item()
result = ordered_df.loc["a":"e"] | ||
assert_frame_equal(result, ordered_df) | ||
|
||
df_slice = ordered_df.loc["a":"b"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this result looks suspect. both a and b are in the categories and its ordered?
also, don't use label based indexers to select the expected, rather use .iloc
so there is no ambiguity (IOW you are making an expected value where it is not clear what is the answer)
e100e74
to
ec3e07c
Compare
I refactored the tests and hope the intention is a bit clearer now. Slicing of the categorical should behave similar to a ordinary index (at least if it is ordered) |
Tests fail because of an import error of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. just some comment on tests. I think it might be worthwhile to update the categorical.rst with a small example of this as well.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -1289,6 +1289,9 @@ Indexing | |||
- Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`) | |||
- Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`) | |||
- Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`) | |||
- Bug in :func:`CategoricalIndex.searchsorted` where the method did not return a scalar when the input values was scalar (:issue:`21019`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to 0.23.1
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -1289,6 +1289,9 @@ Indexing | |||
- Bug in performing in-place operations on a ``DataFrame`` with a duplicate ``Index`` (:issue:`17105`) | |||
- Bug in :meth:`IntervalIndex.get_loc` and :meth:`IntervalIndex.get_indexer` when used with an :class:`IntervalIndex` containing a single interval (:issue:`17284`, :issue:`20921`) | |||
- Bug in ``.loc`` with a ``uint64`` indexer (:issue:`20722`) | |||
- Bug in :func:`CategoricalIndex.searchsorted` where the method did not return a scalar when the input values was scalar (:issue:`21019`) | |||
- Bug in :class:`CategoricalIndex` where slicing beyond the range of the data raised a KeyError (:issue:`21019`) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double backticks on KeyError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see you added to 0.23.1 below, ok, make these changes there and revert this one
# slicing | ||
# not implemented ATM | ||
# GH9748 | ||
# Raises KeyError since the left slice 'a' is not unique |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this issue reference here (gh-....)
# GH9748 | ||
# Raises KeyError since the left slice 'a' is not unique | ||
pytest.raises(KeyError, lambda: self.df.loc["a":"b"]) | ||
result = self.df.loc["b":"c"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this tests is the same as the 3rd case? if so do we need both? (or if not can you move them together and comment)
# right/left edge we should get the original slice again. | ||
result = ordered_df.loc["a": "d"] | ||
assert_frame_equal(result, ordered_df) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also test the left edge as well
pls rebase on master (ci failures are fixed) |
ec3e07c
to
04ca52f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rebase
@@ -88,7 +88,8 @@ Indexing | |||
- Bug in :meth:`MultiIndex.set_names` where error raised for a ``MultiIndex`` with ``nlevels == 1`` (:issue:`21149`) | |||
- Bug in :class:`IntervalIndex` constructors where creating an ``IntervalIndex`` from categorical data was not fully supported (:issue:`21243`, issue:`21253`) | |||
- Bug in :meth:`MultiIndex.sort_index` which was not guaranteed to sort correctly with ``level=1``; this was also causing data misalignment in particular :meth:`DataFrame.stack` operations (:issue:`20994`, :issue:`20945`, :issue:`21052`) | |||
- | |||
- Bug in :func:`CategoricalIndex.searchsorted` where the method did not return a scalar when the input values was scalar (:issue:`21019`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move to 0.23.2
Hey @fjetter , want to follow this to the door? |
can you merge master and can see where this is |
CategoricalIndex.searchsorted
returns the wrong shape for scalar input. Numpy arrays and all other index types return a scalar if the input is a scalar, but theCategoricalIndex
does notFor example
This issue also affects slicing on sorted/ordered categoricals, which is why I've written another test for the slicing.
git diff upstream/master -u -- "*.py" | flake8 --diff