Skip to content

ENH: pd.MultiIndex.get_loc(np.nan) #28919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 9, 2020
Merged

Conversation

proost
Copy link
Contributor

@proost proost commented Oct 11, 2019

MultiIndex.get_loc could not find nan with values including missing
values as a input.

Background: In MultiIndex, missing value is denoted by -1 in codes and doesn't exist in self.levels

So, could not find NA value in self.levels.

Before PR xref #28783

@proost proost changed the title ENH: pd.MultiIndex.get_loc(np.nan) (#19132) ENH: pd.MultiIndex.get_loc(np.nan) Oct 11, 2019
@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Oct 11, 2019
@proost proost requested a review from jreback November 1, 2019 05:30
@proost proost force-pushed the get_loc-nan branch 2 times, most recently from 5da4a8d to 9cf08c1 Compare November 5, 2019 02:51
assert idx.get_loc((np.nan, 1)) == expected


def test_get_indexer_with_missing_value():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing get_loc and get_indexer are good, but we dont expect users to see those directly. is there any user-facing behavior that this changes that should be tested?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you respond to this, is there any user facing change aside from .get_loc itself, IOW does user facing indexing change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
.get_slice_bound, __contains__,slice_indexer, slice_locs affect from this change.
I add test case too.

@pep8speaks
Copy link

pep8speaks commented Nov 21, 2019

Hello @proost! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-07 13:28:41 UTC

@proost proost force-pushed the get_loc-nan branch 7 times, most recently from 14bba33 to a17ef45 Compare November 22, 2019 14:27
assert idx.get_loc((np.nan, 1)) == expected


def test_get_indexer_with_missing_value():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you respond to this, is there any user facing change aside from .get_loc itself, IOW does user facing indexing change?

@proost
Copy link
Contributor Author

proost commented Dec 3, 2019

@jbrockmendel @jreback
get_slice_bound, __contains__, slice_indexer, slice_locs changes from this change. In short, these functions can index NA value. I think, this is better than before. Because As we treat NA value like a non-NA value, we can index more accurately when NA value is in 'MultiIndex'

If,

idx = MultiIndex.from_arrays([[np.nan, "a", "b"], ["c", "d", "e"]])

case1 : "get_slice_bound"

idx.get_slice_bound(np.nan, side="left", kind="ix")
Before:   raise exception
After:  0

case2 : "slice_indexer"

idx.slice_indexer(start=np.nan, end="a")   
Before: raise exception
After: slice(0, 2, None)

case3: "slice_locs"

idx.slice_locs(start=np.nan, end=("a","d"))  
Before: raise exception
After: (0,2)

case4: __contains__

assert np.nan in idx
Before: raise exception
After: Not raise exception

problem is when this PR is merged, if __contains__ is true, but "isin" can be false. because "isin" not affect from this change. so if this PR is merged, need to change "isin" for a consistency.

@jreback
Copy link
Contributor

jreback commented Dec 10, 2019

problem is when this PR is merged, if contains is true, but "isin" can be false. because "isin" not affect from this change. so if this PR is merged, need to change "isin" for a consistency.

can you show what you mean here?

@proost
Copy link
Contributor Author

proost commented Dec 11, 2019

@jreback

idx = MultiIndex.from_arrays([[np.nan, "a", "b"], ["c", "d", "e"]])
assert np.nan in idx

not raise exception. But

idx.isin([np.nan],level=0)
array([False, False, False])

__contains__ says NA value in level 0 index, But isin says doesn't.

So If this PR is merged, change isin for consistency might be good.

@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

can you merge master and will look again

proost added a commit to proost/pandas that referenced this pull request Dec 28, 2019
proost added a commit to proost/pandas that referenced this pull request Dec 28, 2019
proost added a commit to proost/pandas that referenced this pull request Dec 28, 2019
proost added a commit to proost/pandas that referenced this pull request Dec 30, 2019
@jreback jreback added this to the 1.0 milestone Jan 3, 2020
@jreback
Copy link
Contributor

jreback commented Jan 3, 2020

So If this PR is merged, change isin for consistency might be good.

ok this PR looks good. can you add in support for .isin() in this PR? I think it would make sense to merge these simultaneously (e.g. you can also build on this one if that is easier)

proost added a commit to proost/pandas that referenced this pull request Jan 4, 2020
@proost
Copy link
Contributor Author

proost commented Jan 4, 2020

@jreback
Okay, I change .isin and add tests

proost added a commit to proost/pandas that referenced this pull request Jan 4, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. some comments, ping on green.

@toobaz @jbrockmendel @TomAugspurger if you'd care to have a look

else:
return np.lib.arraysetops.in1d(level_codes, sought_labels)
return np.zeros(len(levs), dtype=np.bool_)
return levs.isin(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the edits from L3408 down to here look like they are just nice cleanups independent of the rest of this PR. is that accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
For NA values, someone fixes #30677, then more accurate. "Index.isin" has a bug nonetheless in terms of checking NA value can be possible, This is more accurate

[
([("b", np.nan)], np.array([False, False, True]), None,),
([np.nan, "a"], np.array([True, True, False]), 0),
(["d", np.nan], np.array([False, True, True]), 1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the issue specific to np.nan, or are there other NA values worth testing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
xref #30677. Yes. it is specific to np.nan. In "Index", np.nan , np.NaT, None are discernible not denoted by NA value. So if MultiIndex mixs with np.nan, np.NaT, None all together, result of ".isin" are different from what we know.

@jreback jreback merged commit 0721841 into pandas-dev:master Jan 9, 2020
@jreback
Copy link
Contributor

jreback commented Jan 9, 2020

thanks @proost very nice!

@proost proost deleted the get_loc-nan branch January 9, 2020 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.MultiIndex.get_loc(np.nan)
4 participants