-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: loc raises inconsistent error on unsorted MultiIndex #12790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: loc raises inconsistent error on unsorted MultiIndex #12790
Conversation
Closes GH12660
if isinstance(key, tuple): | ||
required_lexsort_depth = max(required_lexsort_depth, len(key)) | ||
if self.lexsort_depth < required_lexsort_depth: | ||
raise KeyError('MultiIndex Slicing requires the index to be ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this error message could be clearer, something like:
'MultiIndex slicing requires the index to be fully lexsorted up to level {required}, lexsort depth is currently {current}'
Just make it clearer which number is the required one and which one is the current.
I would like to change the actual error, ala, #11897 as well.
there are several cases which need to be changed. |
I agree with changing the message itself, the reason why I didn't do it is because I've seen people actually parsing that string to determine the cause of the error. I would also be happy to introduce the new exception class, let me know if you think it should be part of this PR or a separate thing. I'll go back to check why other parts of the tests failed. |
@@ -1595,6 +1595,8 @@ def get_loc_level(self, key, level=0, drop_level=True): | |||
---------- | |||
key : label or tuple | |||
level : int/level name or list thereof | |||
drop_level : bool | |||
drop a level from the index if only a single element is selected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this for?
I've started digging deeper into the failing tests and realised this is a bit more complicated than I expected, and my understanding of Pandas internals might be limiting here - any comments are welcome. What I can see now:
My current implementation simply enforces the first, stricter rule, but this means that code that previously ran fine (with or without a PerformanceWarning) now raises a KeyError, which is not really nice. (This is why some of the other tests failed.) A bit of a guidance would be much appreciated on which solution you prefer, especially if you want me to add the linear-search thing to the currently failing code paths as well. As an additional note, is it too far-fetched to suggest that the index should be automatically sorted if lexsort_depth < len(key) instead of raising these errors? I think paying attention to all this is pretty annoying from an end-user perspective without any real benefits - if I want to select, I need to sort anyway, so why bother? I assume the reason for not doing this is to avoid implicitly changing the data structures without the user noticing it, right? |
so we label your cases 1-3.
The reason for the lexsortedness is that you might have to potentially re-order something which you may not want to do. Thus it is up to the user. The basic problem is if someone is doing something which removes the sortedness, and doing this in iteratively (of course this is bad), then forcing a sort EACH time is really bad. So its kind of a: I will do the best I can showing a warning when needed, but if perf matters AND you are doing iterative indexing (which you shouldn't be anyhow), then it is up to the user to make it sorted. If you have a good usecase that doesn't fall into this pattern lmk. BTW, this is basically the same issue of "when to recompute the levels". |
btw, try to do this w/o changing any tests (except for trivially as in 1) and see how far you can get. We want to avoid API changes if at all possible. |
Ok, I'll introduce the new exception As for making autosort the default: maybe it could be an option, which could even be used in an option context, with which you could disable the automatic sorting behaviour - if you are doing something iteratively where performance matters and sorting hurts, you can disable it as an optimisation. For everyone else, especially novices, understanding an unnecessary exception was saved. We for example work quite a lot with small datasets, where sorting would probably not be noticeable, and we frequently find ourselves getting these errors due to forgetting sorting (for example after merging different datasets). (Btw I am amazed by your response speed..) |
@yosuah ok see what troubles you have on 3) yeah like to catch those cases if possible. yeah pls create an issue for auto-sorting. Its not hard to do, just maybe not the default. |
A similar issue can also come up on normal indexes if they aren't sorted: So maybe the generic |
yep agree with @shoyer here a generic |
@yosuah want to tackle this? (the actual exception issue is #11897) |
@yosuah want to rebase / update? |
@yosuah want to rebase / update |
Yeah, sorry for the delay, will rebase and get back to it over the weekend. |
@yosuah DO you have time to rebase and update this? |
can you rebase / update? |
pls reopen/comment if you can rebase and continue with this |
git diff upstream/master | flake8 --diff
.loc was fixed to always raise a KeyError with a helpful error message when called on an unsorted MultiIndex DataFrame
Tests ran fine the last time I checked, but if I run them with the latest upstream now I get a totally unrelated ImportError error - I assume it is not related to my changes. Btw this is my first real contribution to a large open source project, I tried to pay attention to everything but let me know if anything needs to be improved!