Skip to content

PERF: tighten _should_compare for MultiIndex #42231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jul 1, 2021

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

@@ -5289,6 +5289,16 @@ def _get_indexer_non_comparable(
"""
if method is not None:
other = unpack_nested_dtype(target)
if self._is_multi ^ other._is_multi:
kind = other.dtype.type if self._is_multi else self.dtype.type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do tests hit this? e.g. as you didn't change anything

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the affected cases was not tested; just added a test for that case.

@jreback jreback added the Performance Memory or execution speed performance label Jun 25, 2021
@jreback
Copy link
Contributor

jreback commented Jun 25, 2021

is there a specific benchmark that this improves?

@jbrockmendel
Copy link
Member Author

is there a specific benchmark that this improves?

No. The motivation is yak-shaving that traces back to getting rid of _convert_list_indexer

@jbrockmendel
Copy link
Member Author

@jreback gentle ping (a whole mess of MultiIndex PRs yak-shaving inconsistencies)

@jreback jreback added this to the 1.4 milestone Jul 1, 2021
@jreback jreback merged commit 381dd06 into pandas-dev:master Jul 1, 2021
@jbrockmendel jbrockmendel deleted the bug-mi-should_compare branch July 1, 2021 23:08
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
@bashtage
Copy link
Contributor

@jbrockmendel This PR seems to have introduced a bug. I verified this with the code below and a bisect.

import pandas as pd
mi = pd.MultiIndex.from_product([["a","b","c"],[1,2,3],["z","y","x"]])
df = pd.DataFrame(index=mi,dtype=float)
mi2 = pd.MultiIndex.from_product([["a","b","c"],[1,2,3]])
s = pd.Series(index=mi2, dtype=float)
s.iloc[:]=3.14
df["new"] = s
print(df)

now returns

        new
a 1 z  NaN
    y  NaN
    x  NaN
  2 z  NaN
    y  NaN
    x  NaN
  3 z  NaN
    y  NaN
    x  NaN

Before this patch is did a broadcast assignment to the remaining MultiIndex levels, i.e..

        new
a 1 z  3.14
    y  3.14
    x  3.14
  2 z  3.14
    y  3.14
    x  3.14
  3 z  3.14
    y  3.14
    x  3.14

@bashtage
Copy link
Contributor

xref #40186

# other contains only tuples so unless we are object-dtype,
# there can never be any matches
return self._is_comparable_dtype(dtype)
return self.nlevels == other.nlevels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change that is breaking MultiIndex broadcasting. If one has 3 levels and the other has 2, then this is False. Previously these were comparable and so would be compared and expanded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. do you know what the calling method is in the problematic case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walking back, previous is

if not self._should_compare(target) and not self._should_partial_index(target):

then

indexer = self.get_indexer(

Here self is the Series with 2 levels and other is the DataFrame with 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big change is driven by the return difference of self._should_compare(target). Before this patch it returned True, so the if not ... block was skipped. It now returns False, and so it incorrectly shortcuts and fills with an NA value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i think ive got a handle on whats going on here. The long-term fix will be in MultiIndex.get_indexer, but for now this should just be reverted.

jbrockmendel added a commit that referenced this pull request Jul 16, 2021
jreback pushed a commit that referenced this pull request Jul 25, 2021
CGe0516 pushed a commit to CGe0516/pandas that referenced this pull request Jul 29, 2021
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants