-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Preserve key order when using loc on MultiIndex DataFrame #28933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Preserve key order when using loc on MultiIndex DataFrame #28933
Conversation
pandas/core/indexes/multi.py
Outdated
@@ -3095,6 +3105,32 @@ def _update_indexer(idxr, indexer=indexer): | |||
# empty indexer | |||
if indexer is None: | |||
return Int64Index([])._ndarray_values | |||
|
|||
# Generate tuples of keys by which to order the results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really complex and adding quite a bit of code. Please take another look to simplify greatly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing, will do. I did have a simpler solution, but the performance hit was really high.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just always sort if the the level is already lexsorted, bascially putting your additional codes into that you added into _udpate_indexer?
This is kind of what I did in another PR #29190. To resume the two solution I made a PR for:
|
pandas/core/indexes/multi.py
Outdated
@@ -3095,6 +3103,31 @@ def _update_indexer(idxr, indexer=indexer): | |||
# empty indexer | |||
if indexer is None: | |||
return Int64Index([])._ndarray_values | |||
|
|||
# Generate tuples of keys by which to order the results | |||
if need_sort: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you just check is_lexsorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the same thing. The index may or may not be lexsorted, but what I want to know here is if the keys given to .loc are in the same order has the index (see line 3058), and if not, I reorder the result in indexer to have them in a order reflecting the given keys order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a whatsnew in 1.0.0.rst?
pandas/core/indexes/multi.py
Outdated
if not need_sort: | ||
k_codes = self.levels[i].get_indexer(k) | ||
k_codes = k_codes[k_codes >= 0] # Filter absent keys | ||
need_sort = not (k_codes[:-1] < k_codes[1:]).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall: does NumPy short-circut any
? If so it may be faster to change the comparision to >=
, the all
to an any
and remove the not
(haven't tested).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to this discussion https://stackoverflow.com/questions/45771554/why-numpy-any-has-no-short-circuit-mechanism#45773662, neither all
nor any
short-circuit anymore.
I will changed it however, as it improve readability imo.
Also, it point out an error I made, it should have been <=
here, as two consecutive equals elements are still sorted. Nice catch.
pandas/core/indexes/multi.py
Outdated
from typing import Tuple | ||
|
||
n = len(self) | ||
keys = tuple() # type: Tuple[np.ndarray, ...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use the py36 form here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean no type hint?
I add this when mypy complained with the following
pandas/core/indexes/multi.py:3130: error: Need type annotation for 'keys'
When adding test cases, I come upon the following question.
|
this got a bit lost, can you merge master and ping on green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you merge master.
i think it might be worthile to remove need_sort entirely (from the main code) and just always call _reorder_indexer where you can determine the need_sort (and just bail early if you don't need it).
right?
then
|
||
|
||
def test_multiindex_loc_order(): | ||
# GH 22797 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you think its easy to parameterize this would be great as well.
This make the code more readable, thanks for the advice. I merged master too. I also added testing for columns selection. |
pandas/tests/test_multilevel.py
Outdated
exp_index = pd.MultiIndex.from_arrays([["a", "a", "b", "b"], [1, 2, 1, 2]]) | ||
tm.assert_index_equal(res.index, exp_index) | ||
|
||
res = df.loc[(["a", "b"], [1, 2]), :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you parameterize this, create additional tests for things that don't fit nicely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but I fear i may have over parametrize…
pandas/core/indexes/multi.py
Outdated
# True if the given codes are not ordered | ||
need_sort = (k_codes[:-1] > k_codes[1:]).any() | ||
# Bail out if no need to sort | ||
# This is only true for a lexsorted index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just test is_lexsorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that sufficient here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The property we need to test is "do the index and the selection share the same order".
If we have
df = pd.DataFrame(np.arange(2), index=[["b", "a"]])
df
0
b 0
a 1
Then the index is not lexsorted, and we do not need to sort when doing
df.loc[['b', 'a']]
, but we need to sort for getting df.loc[['a', 'b']]
in the right order.
But they here room for improvement in this place, and I will propose something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I made a minor modification, but is the index is lexsorted, we still need to check if the requested order is also sorted before bailing out.
A another possibility is to sort in every case, if the performance hit is acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. if you can update the whatsnew with an example and rebase can merge this in. sorry take a while, this is tricky code you are fixing.
@@ -114,7 +114,7 @@ Missing | |||
|
|||
MultiIndex | |||
^^^^^^^^^^ | |||
|
|||
- Bug in :meth:`Dataframe.loc` when used with a :class:`MultiIndex`. The returned values were not in the same order as the given inputs (:issue:`22797`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make a mini-example (separate sub-section). I think this is very hard to visualize from the text, but the basic example we are using is very clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing.
8251ada
to
87a4afd
Compare
@nrebena sorry, if you'd merge master again. ping on green. |
Testing return order of MultiIndex.loc MultiIndex.loc try to return the result in the same order as the key given.
From issue pandas-dev#22797. When given a list like object as indexer, the returned result did not respect the order of the indexer, but the order of the MultiIndex levels.
Test if the result of the loc function need to be sorted to return them in the same order as the indexer. If not, skip the sort to improve performance.
Move code from get_locs to _reorder_indexer. Better use of get_indexer to get level_code location.
This reverts commit 7fee53c.
87a4afd
to
025d304
Compare
@jreback Merged and green |
thanks @nrebena very nice! sorry took a long time :> |
@jreback It was a pleasure, I learned a lot. Thank for the guidance. |
Description
closes #22797
As described in #22797, the key order given to loc for a MultiIndex DataFrame was not respected:
Proposed fix
The culprit was the use of intersection of indexers in the loc function. I tried keeping the indexers sorted during the whole function (in the main loop), but performance were really affected (by a factor 3!!!).
As an other solution, I tried to sort the result after the indexers were computed. It was already way better (worse "only" by a factor 1.15 or so, see the asv benchmark result).
So I computed and add a flag testing if the result need to be sorted (the benchmark seems to always have sorted key in the loc call).
Update The sorting function is now a separate private function (_reorder_indexer). It is called at the end of the get_locs function.
Benchmark
Benchmark with the flag (I run asv compare with -s option):
Benchmark without flag:
Checklist
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff