Skip to content

BUG: MultiIndex.get_indexer not matching nan values #49442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Nov 7, 2022

Conversation

phofl
Copy link
Member

@phofl phofl commented Nov 1, 2022

This was bugging me for quite some time.

The previous solution did not provide a way to differentiate between matching missing values (e.g. NaN and NaN) and values from other that were missing in self. With shifting NaNs to 1 and values from other that are not in self to 0, we can calculate the correct result.

@phofl phofl added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Nov 1, 2022
@lukemanley
Copy link
Member

Nice. Does this close all the issues linked in #48877? Are any of those tests worth adding?

@phofl
Copy link
Member Author

phofl commented Nov 1, 2022

It should, and yes. I'd like to merge the tests of your pr after this is through if you are ok with that

phofl and others added 2 commits November 2, 2022 19:41
# with positive integers (-1 for NaN becomes 1). This enables us to
# differentiate between values that are missing in other and matching
# NaNs. We will set values that are not found to 0 later:
codes = (np.array(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be split onto multiple lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -36,6 +36,8 @@ from pandas._libs.missing cimport (
is_matching_na,
)

multiindex_nulls_shift = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment about what this is

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mroeschke mroeschke added this to the 2.0 milestone Nov 4, 2022
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cc @jbrockmendel merge when ready

@mroeschke mroeschke merged commit 91e1fdc into pandas-dev:main Nov 7, 2022
@mroeschke
Copy link
Member

Thanks @phofl

phofl added a commit to phofl/pandas that referenced this pull request Nov 9, 2022
* BUG: MultiIndex.get_indexer not matching nan values

* Fix tests

* Use variable for shifts

* Fix mypy

* Add gh ref

* Update pandas/_libs/index.pyx

Co-authored-by: Matthew Roeschke <[email protected]>

* Adress review

Co-authored-by: Matthew Roeschke <[email protected]>
noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* BUG: MultiIndex.get_indexer not matching nan values

* Fix tests

* Use variable for shifts

* Fix mypy

* Add gh ref

* Update pandas/_libs/index.pyx

Co-authored-by: Matthew Roeschke <[email protected]>

* Adress review

Co-authored-by: Matthew Roeschke <[email protected]>
@phofl phofl deleted the multiindex_engine branch November 10, 2022 20:42
@lukemanley lukemanley mentioned this pull request Jan 4, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: get_indexer for MultiIndex with nans returns wrong indexer
4 participants