Skip to content

Bug in loc not raising KeyError with MultiIndex containing no longer used levels #41358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 12, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented May 6, 2021

stumbled across a bug in MultiIndex.reindex, which caused the ValueError from the op

@phofl phofl added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels May 6, 2021
@attack68
Copy link
Contributor

attack68 commented May 6, 2021

Not sure if this is relevant to this PR but wanted to raise to the point that KeyErrors with MultiIndexes I think are dangerous because a MultiIndex cannot be easily reindexed, to permit a valid key.

For example if you wanted to reindex a MultiIndex and insert a single key, a, into one level of a multilevel index then it is ambiguous how we might do that, and in the exhaustive case of allowing a to all possible combinations of values of keys within the other levels you will quickly exhaust the memory due to the number of combinations.

I previously raised an issue where I thought the opposite (basing my opinion solely on the case of consistency with single indexes) but revised later after I realised it could be unworkable.

@phofl
Copy link
Member Author

phofl commented May 6, 2021

Yes, I think I remember this discussion, but this does not really apply here, since "b" is not in the Index at all, the same as

df = pd.DataFrame({"A": [12,23,34,45]}, index = [list("aabb"), [0,1,2,3]])
df.loc[["c"], :]

I think the discussion was about something like:

df.loc[("a", 3), :]

where both elements are in the index, but the combination of both is not?

@@ -104,3 +104,14 @@ def test_reindex_non_unique():
msg = "cannot handle a non-unique multi-index!"
with pytest.raises(ValueError, match=msg):
a.reindex(new_idx)


@pytest.mark.parametrize("values", [[["a"], ["x"]], [[], []]])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was user facing in a Series.reindex right? can you add the user facing part of this test as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but is essentially the same

@attack68
Copy link
Contributor

attack68 commented May 7, 2021

Yes, I think I remember this discussion, but this does not really apply here, since "b" is not in the Index at all, the same as

df = pd.DataFrame({"A": [12,23,34,45]}, index = [list("aabb"), [0,1,2,3]])
df.loc[["c"], :]

I think the discussion was about something like:

df.loc[("a", 3), :]

where both elements are in the index, but the combination of both is not?

But my point in this context is that if a user has some value he want to slice over that is not in df.index for a single index he has two options:

  • the index can reindexed (to include "c") which is the suggested route given the introduction of a KeyError.
  • the indexer arg must be altered to explicitly exclude missing values from df.index (i.e. remove "c")

Since MultiIndexes cannot be reindexed without ambiguity or without necessarily expanding the number of level combinations to fill memory (in this case adding "c" would return (c,0) (c,1) (c,2) (c,3) which is exponentially bad) then the user is restricted, in the case of MultiIndexes to:

  • the slice must be altered.

Which means that for each level the user must compare their indexer level args and exclude those that are not present.

I think this level of restriction will be more problematic than simply not returning values that don't exist?

@phofl
Copy link
Member Author

phofl commented May 7, 2021

Ah ok, did not remember this. I think this is a discussion we should have, but the pr is as you mentioned above probably not the right place, since this makes behavior more consistent and does not really change anything. Maybe ping on the other issue were the initial discussion was?

@attack68
Copy link
Contributor

attack68 commented May 8, 2021

ok take a look here: #41358 sorry: #39424

@jreback jreback added this to the 1.3 milestone May 12, 2021
@jreback jreback merged commit c9f2ecc into pandas-dev:master May 12, 2021
@jreback
Copy link
Contributor

jreback commented May 12, 2021

thanks @phofl very nice.

@attack68 thanks for all of the comments & notes on MI indexing. Let's have that discussion in the issue you pointed. #39424

@phofl phofl deleted the 41170 branch May 12, 2021 09:19
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
3 participants