BUG: loc raises inconsistent error on unsorted MultiIndex #12790

adamdivak · 2016-04-03T22:54:33Z

closes .loc sometimes raises KeyError without an error message when called on an unsorted MultiIndex DataFrame #12660
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

.loc was fixed to always raise a KeyError with a helpful error message when called on an unsorted MultiIndex DataFrame

Tests ran fine the last time I checked, but if I run them with the latest upstream now I get a totally unrelated ImportError error - I assume it is not related to my changes. Btw this is my first real contribution to a large open source project, I tried to pay attention to everything but let me know if anything needs to be improved!

Closes GH12660

onesandzeroes · 2016-04-04T01:20:11Z

pandas/indexes/multi.py

+        if isinstance(key, tuple):
+            required_lexsort_depth = max(required_lexsort_depth, len(key))
+        if self.lexsort_depth < required_lexsort_depth:
+            raise KeyError('MultiIndex Slicing requires the index to be '


I think this error message could be clearer, something like:

'MultiIndex slicing requires the index to be fully lexsorted up to level {required}, lexsort depth is currently {current}'

Just make it clearer which number is the required one and which one is the current.

jreback · 2016-04-04T02:46:13Z

I would like to change the actual error, ala, #11897 as well.

class NonLexsortedMultiIndexError(KeyError).

there are several cases which need to be changed.

adamdivak · 2016-04-04T08:56:15Z

I agree with changing the message itself, the reason why I didn't do it is because I've seen people actually parsing that string to determine the cause of the error. I would also be happy to introduce the new exception class, let me know if you think it should be part of this PR or a separate thing.

I'll go back to check why other parts of the tests failed.

jreback · 2016-04-04T12:16:49Z

pandas/indexes/multi.py

@@ -1595,6 +1595,8 @@ def get_loc_level(self, key, level=0, drop_level=True):
        ----------
        key : label or tuple
        level : int/level name or list thereof
+        drop_level : bool
+            drop a level from the index if only a single element is selected


what is this for?

adamdivak · 2016-04-04T21:43:41Z

I've started digging deeper into the failing tests and realised this is a bit more complicated than I expected, and my understanding of Pandas internals might be limiting here - any comments are welcome.

What I can see now:

On some code paths making sure that the MI is lexsorted to at least the selection depth is enforced, otherwise a KeyError is raised (https://github.com/pydata/pandas/blob/master/pandas/indexes/multi.py#L1823)
On some code paths selection from an unsorted MI is supported, as a linear search algorithm is implemented, and a PerformanceWarning is raised to notify the user about possible consquences (https://github.com/pydata/pandas/blob/master/pandas/indexes/multi.py#L1575). However I believe there is also a bug here, as the warning raised says "PerformanceWarning: indexing past lexsort depth may impact performance.", but it is not raised if the MultiIndex is not lexsorted, it is raised if the MI is not unique.
On some code paths no check is performed, and the execution ends up in Cython land, where it fails (this is what I originally wanted to correct) (https://github.com/pydata/pandas/blob/master/pandas/indexes/multi.py#L1686, https://github.com/pydata/pandas/blob/master/pandas/index.pyx#L137)

My current implementation simply enforces the first, stricter rule, but this means that code that previously ran fine (with or without a PerformanceWarning) now raises a KeyError, which is not really nice. (This is why some of the other tests failed.)

A bit of a guidance would be much appreciated on which solution you prefer, especially if you want me to add the linear-search thing to the currently failing code paths as well.

As an additional note, is it too far-fetched to suggest that the index should be automatically sorted if lexsort_depth < len(key) instead of raising these errors? I think paying attention to all this is pretty annoying from an end-user perspective without any real benefits - if I want to select, I need to sort anyway, so why bother? I assume the reason for not doing this is to avoid implicitly changing the data structures without the user noticing it, right?

jreback · 2016-04-04T21:49:47Z

so we label your cases 1-3.

ideally change this to NonLexsortedMultiIndexError rather than KeyError, should pass existing tests, though once you change it you need to change the tests to look specifically for this error
PerformanceWarning. Leave this for now, having a partially lex-sorted index is not easy to create and normally won't be, so its a minor issue IMHO.
yes this would be nice to catch this case with a NonLexsortedMultiIndexError.

The reason for the lexsortedness is that you might have to potentially re-order something which you may not want to do. Thus it is up to the user. The basic problem is if someone is doing something which removes the sortedness, and doing this in iteratively (of course this is bad), then forcing a sort EACH time is really bad.

So its kind of a: I will do the best I can showing a warning when needed, but if perf matters AND you are doing iterative indexing (which you shouldn't be anyhow), then it is up to the user to make it sorted.

If you have a good usecase that doesn't fall into this pattern lmk.

BTW, this is basically the same issue of "when to recompute the levels".

jreback · 2016-04-04T21:51:04Z

btw, try to do this w/o changing any tests (except for trivially as in 1) and see how far you can get. We want to avoid API changes if at all possible.

adamdivak · 2016-04-04T22:10:01Z

Ok, I'll introduce the new exception NonLexsortedMultiIndexError. However I am afraid that fixing 3) means that the same tests that were failing in the last build will be still failing, as those are the cases where the PerfWarning was shown previously, so this would be a partially backwards-compatibility-breaking change.

As for making autosort the default: maybe it could be an option, which could even be used in an option context, with which you could disable the automatic sorting behaviour - if you are doing something iteratively where performance matters and sorting hurts, you can disable it as an optimisation. For everyone else, especially novices, understanding an unnecessary exception was saved. We for example work quite a lot with small datasets, where sorting would probably not be noticeable, and we frequently find ourselves getting these errors due to forgetting sorting (for example after merging different datasets).

(Btw I am amazed by your response speed..)

jreback · 2016-04-04T22:11:46Z

@yosuah

ok see what troubles you have on 3) yeah like to catch those cases if possible.

yeah pls create an issue for auto-sorting. Its not hard to do, just maybe not the default.

shoyer · 2016-05-05T14:59:59Z

A similar issue can also come up on normal indexes if they aren't sorted:
#13090

So maybe the generic UnsortedIndexError is the way to go?

jreback · 2016-05-05T15:27:19Z

yep agree with @shoyer here a generic UnsortedIndex Exception is the way to go with an appropriate message (will be back-compat as it will inherit KeyError).

jreback · 2016-05-05T15:28:34Z

@yosuah want to tackle this? (the actual exception issue is #11897)

jreback · 2016-05-25T15:58:49Z

@yosuah want to rebase / update?

jreback · 2016-05-31T15:22:39Z

@yosuah want to rebase / update

adamdivak · 2016-06-07T06:27:25Z

Yeah, sorry for the delay, will rebase and get back to it over the weekend.

jorisvandenbossche · 2016-08-21T14:45:50Z

@yosuah DO you have time to rebase and update this?

jreback · 2016-09-09T22:33:36Z

can you rebase / update?

jreback · 2016-11-16T22:15:16Z

pls reopen/comment if you can rebase and continue with this

BUG: loc raises inconsistent error on unsorted MultiIndex

2cdfa64

Closes GH12660

onesandzeroes reviewed Apr 4, 2016
View reviewed changes

jreback added Error Reporting Incorrect or improved errors from pandas MultiIndex labels Apr 4, 2016

jreback reviewed Apr 4, 2016
View reviewed changes

jreback mentioned this pull request May 5, 2016

Slicing on DatetimeIndex throws KeyError: [int] not found #13090

Open

jreback closed this Nov 16, 2016

jorisvandenbossche added the Closed PR label Nov 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: loc raises inconsistent error on unsorted MultiIndex #12790

BUG: loc raises inconsistent error on unsorted MultiIndex #12790

adamdivak commented Apr 3, 2016

onesandzeroes Apr 4, 2016

jreback commented Apr 4, 2016

adamdivak commented Apr 4, 2016

jreback Apr 4, 2016

adamdivak commented Apr 4, 2016

jreback commented Apr 4, 2016

jreback commented Apr 4, 2016

adamdivak commented Apr 4, 2016

jreback commented Apr 4, 2016

shoyer commented May 5, 2016

jreback commented May 5, 2016

jreback commented May 5, 2016

jreback commented May 25, 2016

jreback commented May 31, 2016

adamdivak commented Jun 7, 2016

jorisvandenbossche commented Aug 21, 2016

jreback commented Sep 9, 2016

jreback commented Nov 16, 2016

BUG: loc raises inconsistent error on unsorted MultiIndex #12790

BUG: loc raises inconsistent error on unsorted MultiIndex #12790

Conversation

adamdivak commented Apr 3, 2016

onesandzeroes Apr 4, 2016

Choose a reason for hiding this comment

jreback commented Apr 4, 2016

adamdivak commented Apr 4, 2016

jreback Apr 4, 2016

Choose a reason for hiding this comment

adamdivak commented Apr 4, 2016

jreback commented Apr 4, 2016

jreback commented Apr 4, 2016

adamdivak commented Apr 4, 2016

jreback commented Apr 4, 2016

shoyer commented May 5, 2016

jreback commented May 5, 2016

jreback commented May 5, 2016

jreback commented May 25, 2016

jreback commented May 31, 2016

adamdivak commented Jun 7, 2016

jorisvandenbossche commented Aug 21, 2016

jreback commented Sep 9, 2016

jreback commented Nov 16, 2016