Skip to content

BUG: loc raises inconsistent error on unsorted MultiIndex #12790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

BUG: loc raises inconsistent error on unsorted MultiIndex #12790

wants to merge 1 commit into from

Conversation

adamdivak
Copy link

.loc was fixed to always raise a KeyError with a helpful error message when called on an unsorted MultiIndex DataFrame

Tests ran fine the last time I checked, but if I run them with the latest upstream now I get a totally unrelated ImportError error - I assume it is not related to my changes. Btw this is my first real contribution to a large open source project, I tried to pay attention to everything but let me know if anything needs to be improved!

if isinstance(key, tuple):
required_lexsort_depth = max(required_lexsort_depth, len(key))
if self.lexsort_depth < required_lexsort_depth:
raise KeyError('MultiIndex Slicing requires the index to be '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this error message could be clearer, something like:

'MultiIndex slicing requires the index to be fully lexsorted up to level {required}, lexsort depth is currently {current}'

Just make it clearer which number is the required one and which one is the current.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas MultiIndex labels Apr 4, 2016
@jreback
Copy link
Contributor

jreback commented Apr 4, 2016

I would like to change the actual error, ala, #11897 as well.

class NonLexsortedMultiIndexError(KeyError).

there are several cases which need to be changed.

@adamdivak
Copy link
Author

I agree with changing the message itself, the reason why I didn't do it is because I've seen people actually parsing that string to determine the cause of the error. I would also be happy to introduce the new exception class, let me know if you think it should be part of this PR or a separate thing.

I'll go back to check why other parts of the tests failed.

@@ -1595,6 +1595,8 @@ def get_loc_level(self, key, level=0, drop_level=True):
----------
key : label or tuple
level : int/level name or list thereof
drop_level : bool
drop a level from the index if only a single element is selected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this for?

@adamdivak
Copy link
Author

I've started digging deeper into the failing tests and realised this is a bit more complicated than I expected, and my understanding of Pandas internals might be limiting here - any comments are welcome.

What I can see now:

My current implementation simply enforces the first, stricter rule, but this means that code that previously ran fine (with or without a PerformanceWarning) now raises a KeyError, which is not really nice. (This is why some of the other tests failed.)

A bit of a guidance would be much appreciated on which solution you prefer, especially if you want me to add the linear-search thing to the currently failing code paths as well.

As an additional note, is it too far-fetched to suggest that the index should be automatically sorted if lexsort_depth < len(key) instead of raising these errors? I think paying attention to all this is pretty annoying from an end-user perspective without any real benefits - if I want to select, I need to sort anyway, so why bother? I assume the reason for not doing this is to avoid implicitly changing the data structures without the user noticing it, right?

@jreback
Copy link
Contributor

jreback commented Apr 4, 2016

so we label your cases 1-3.

  1. ideally change this to NonLexsortedMultiIndexError rather than KeyError, should pass existing tests, though once you change it you need to change the tests to look specifically for this error
  2. PerformanceWarning. Leave this for now, having a partially lex-sorted index is not easy to create and normally won't be, so its a minor issue IMHO.
  3. yes this would be nice to catch this case with a NonLexsortedMultiIndexError.

The reason for the lexsortedness is that you might have to potentially re-order something which you may not want to do. Thus it is up to the user. The basic problem is if someone is doing something which removes the sortedness, and doing this in iteratively (of course this is bad), then forcing a sort EACH time is really bad.

So its kind of a: I will do the best I can showing a warning when needed, but if perf matters AND you are doing iterative indexing (which you shouldn't be anyhow), then it is up to the user to make it sorted.

If you have a good usecase that doesn't fall into this pattern lmk.

BTW, this is basically the same issue of "when to recompute the levels".

@jreback
Copy link
Contributor

jreback commented Apr 4, 2016

btw, try to do this w/o changing any tests (except for trivially as in 1) and see how far you can get. We want to avoid API changes if at all possible.

@adamdivak
Copy link
Author

Ok, I'll introduce the new exception NonLexsortedMultiIndexError. However I am afraid that fixing 3) means that the same tests that were failing in the last build will be still failing, as those are the cases where the PerfWarning was shown previously, so this would be a partially backwards-compatibility-breaking change.

As for making autosort the default: maybe it could be an option, which could even be used in an option context, with which you could disable the automatic sorting behaviour - if you are doing something iteratively where performance matters and sorting hurts, you can disable it as an optimisation. For everyone else, especially novices, understanding an unnecessary exception was saved. We for example work quite a lot with small datasets, where sorting would probably not be noticeable, and we frequently find ourselves getting these errors due to forgetting sorting (for example after merging different datasets).

(Btw I am amazed by your response speed..)

@jreback
Copy link
Contributor

jreback commented Apr 4, 2016

@yosuah

ok see what troubles you have on 3) yeah like to catch those cases if possible.

yeah pls create an issue for auto-sorting. Its not hard to do, just maybe not the default.

@shoyer
Copy link
Member

shoyer commented May 5, 2016

A similar issue can also come up on normal indexes if they aren't sorted:
#13090

So maybe the generic UnsortedIndexError is the way to go?

@jreback
Copy link
Contributor

jreback commented May 5, 2016

yep agree with @shoyer here a generic UnsortedIndex Exception is the way to go with an appropriate message (will be back-compat as it will inherit KeyError).

@jreback
Copy link
Contributor

jreback commented May 5, 2016

@yosuah want to tackle this? (the actual exception issue is #11897)

@jreback
Copy link
Contributor

jreback commented May 25, 2016

@yosuah want to rebase / update?

@jreback
Copy link
Contributor

jreback commented May 31, 2016

@yosuah want to rebase / update

@adamdivak
Copy link
Author

Yeah, sorry for the delay, will rebase and get back to it over the weekend.

@jorisvandenbossche
Copy link
Member

@yosuah DO you have time to rebase and update this?

@jreback
Copy link
Contributor

jreback commented Sep 9, 2016

can you rebase / update?

@jreback
Copy link
Contributor

jreback commented Nov 16, 2016

pls reopen/comment if you can rebase and continue with this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.loc sometimes raises KeyError without an error message when called on an unsorted MultiIndex DataFrame
5 participants