Skip to content

BUG: get_indexer rountripping through string dtype #56013

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 16, 2024
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.1.4.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Bug fixes
~~~~~~~~~
- Bug in :class:`Series` constructor raising DeprecationWarning when ``index`` is a list of :class:`Series` (:issue:`55228`)
- Fixed bug in :meth:`DataFrame.__setitem__` casting :class:`Index` with object-dtype to PyArrow backed strings when ``infer_string`` option is set (:issue:`55638`)
- Fixed bug in :meth:`Index.get_indexer` round-tripping through string dtype when ``infer_string`` is enabled (:issue:`55834`)
- Fixed bug in :meth:`Index.insert` casting object-dtype to PyArrow backed strings when ``infer_string`` option is set (:issue:`55638`)
-

Expand Down
3 changes: 3 additions & 0 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -6591,6 +6591,9 @@ def _maybe_cast_listlike_indexer(self, target) -> Index:
"""
Analogue to maybe_cast_indexer for get_indexer instead of get_loc.
"""
if not hasattr(target, "dtype") and self.dtype == object:
# Avoid inference for object since we are casting back later anyway
return Index(target, dtype=self.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this might prevent us from doing intentional inference in _maybe_downcast_for_indexing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel could you think of a concrete situation where this might happen?

(the tests are all passing, so don't seem to cover that?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at _maybe_downcast_for_indexing more closely, we are downcasting object dtype index to the other dtype, because this can give a performance improvement, right?

This might actually also be true for object/string combo, but I think correctness if more important, and if missing values are handled differently by downcasting, we should avoid that for now for this combo, until we can solve that better.

I will just make the check a bit more specific for object dtype then (even more ugly, but at least fixing this issue)

return ensure_index(target)

@final
Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/indexes/object/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import pytest

from pandas._libs.missing import is_matching_na
import pandas.util._test_decorators as td

import pandas as pd
from pandas import Index
Expand Down Expand Up @@ -55,6 +56,14 @@ def test_get_indexer_with_NA_values(
expected = np.array([0, 1, -1], dtype=np.intp)
tm.assert_numpy_array_equal(result, expected)

@td.skip_if_no("pyarrow")
def test_get_indexer_infer_string_missing_values(self):
# GH#55834
idx = Index(["a", "b", None], dtype="object")
result = idx.get_indexer([None, "x"])
expected = np.array([2, -1])
tm.assert_numpy_array_equal(result, expected, check_dtype=False)


class TestGetIndexerNonUnique:
def test_get_indexer_non_unique_nas(self, nulls_fixture):
Expand Down