Skip to content

PERF: tighten _should_compare for MultiIndex #42231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 1, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -5345,6 +5345,16 @@ def _get_indexer_non_comparable(
"""
if method is not None:
other = unpack_nested_dtype(target)
if self._is_multi ^ other._is_multi:
kind = other.dtype.type if self._is_multi else self.dtype.type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do tests hit this? e.g. as you didn't change anything

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the affected cases was not tested; just added a test for that case.

raise TypeError(
f"'<' not supported between instances of {kind} and 'tuple'"
)
elif self._is_multi and other._is_multi:
assert self.nlevels != other.nlevels
# Python allows comparison between tuples of different lengths,
# but for our purposes such a comparison is not meaningful.
raise TypeError("'<' not supported between tuples of different lengths")
raise TypeError(f"Cannot compare dtypes {self.dtype} and {other.dtype}")

no_matches = -1 * np.ones(target.shape, dtype=np.intp)
Expand Down Expand Up @@ -5474,6 +5484,14 @@ def _should_compare(self, other: Index) -> bool:

other = unpack_nested_dtype(other)
dtype = other.dtype
if other._is_multi:
if not self._is_multi:
# other contains only tuples so unless we are object-dtype,
# there can never be any matches
return self._is_comparable_dtype(dtype)
return self.nlevels == other.nlevels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change that is breaking MultiIndex broadcasting. If one has 3 levels and the other has 2, then this is False. Previously these were comparable and so would be compared and expanded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. do you know what the calling method is in the problematic case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walking back, previous is

if not self._should_compare(target) and not self._should_partial_index(target):

then

indexer = self.get_indexer(

Here self is the Series with 2 levels and other is the DataFrame with 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big change is driven by the return difference of self._should_compare(target). Before this patch it returned True, so the if not ... block was skipped. It now returns False, and so it incorrectly shortcuts and fills with an NA value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i think ive got a handle on whats going on here. The long-term fix will be in MultiIndex.get_indexer, but for now this should just be reverted.

# TODO: we can get more specific requiring levels are comparable?

return self._is_comparable_dtype(dtype) or is_object_dtype(dtype)

def _is_comparable_dtype(self, dtype: DtypeObj) -> bool:
Expand Down
9 changes: 9 additions & 0 deletions pandas/tests/indexes/multi/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -457,6 +457,15 @@ def test_get_indexer_kwarg_validation(self):
with pytest.raises(ValueError, match=msg):
mi.get_indexer(mi[:-1], tolerance="piano")

def test_get_indexer_mismatched_nlevels(self):
mi = MultiIndex.from_product([range(3), ["A", "B"]])

other = MultiIndex.from_product([range(3), ["A", "B"], range(2)])

msg = "tuples of different lengths"
with pytest.raises(TypeError, match=msg):
mi.get_indexer(other, method="pad")


def test_getitem(idx):
# scalar
Expand Down