BUG: Preserve key order when using loc on MultiIndex DataFrame #28933

nrebena · 2019-10-11T18:52:53Z

Description

closes #22797
As described in #22797, the key order given to loc for a MultiIndex DataFrame was not respected:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'],
    ['Green', 'Red', 'Green']])

df.loc[(['b','a'],[2, 1]),:]

# Out
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

Proposed fix

The culprit was the use of intersection of indexers in the loc function. I tried keeping the indexers sorted during the whole function (in the main loop), but performance were really affected (by a factor 3!!!).
As an other solution, I tried to sort the result after the indexers were computed. It was already way better (worse "only" by a factor 1.15 or so, see the asv benchmark result).
So I computed and add a flag testing if the result need to be sorted (the benchmark seems to always have sorted key in the loc call).

Update The sorting function is now a separate private function (_reorder_indexer). It is called at the end of the get_locs function.

Benchmark

Benchmark with the flag (I run asv compare with -s option):

Benchmarks that have got worse:

   before           after         ratio
 [39602e7d]       [da8b55af]
 <master>         <multiindex_sort_loc_order_issue_22797>

 5.62±0.2μs       6.27±0.2μs     1.11  index_cached_properties.IndexCache.time_shape('Float64Index')

 6.57±0.2μs       7.49±0.2μs     1.14  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')

Benchmark without flag:

Benchmarks that have got worse:

   before           after         ratio
 [39602e7d]       [c786822a]
 <master>         <multiindex_sort_loc_order_issue_22797~1>

2.49±0.02ms      2.87±0.01ms     1.15  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'int')

   2.53±0ms      2.91±0.01ms     1.15  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'int')

 29.2±0.7ms      33.1±0.02ms     1.13  frame_ctor.FromLists.time_frame_from_lists

   87.2±1ms         98.9±1ms     1.13  frame_ctor.FromRecords.time_frame_from_records_generator(None)

12.8±0.09ms      14.3±0.09ms     1.11  groupby.MultiColumn.time_col_select_numpy_sum

 5.62±0.2μs       6.32±0.4μs     1.12  index_cached_properties.IndexCache.time_shape('Float64Index')

4.96±0.02ms      5.71±0.01ms     1.15  indexing.MultiIndexing.time_index_slice

   2.91±0ms      3.29±0.01ms     1.13  inference.ToNumeric.time_from_numeric_str('coerce')

   2.92±0ms      3.29±0.01ms     1.13  inference.ToNumeric.time_from_numeric_str('ignore')

3.45±0.01ms      3.84±0.01ms     1.11  series_methods.Map.time_map('lambda', 'object')

 29.3±0.2ms      33.2±0.04ms     1.13  strings.Methods.time_len

Checklist

closes loc() does not swap two rows in multi-index pandas dataframe #22797
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2019-10-12T17:06:23Z

pandas/core/indexes/multi.py

@@ -3095,6 +3105,32 @@ def _update_indexer(idxr, indexer=indexer):
        # empty indexer
        if indexer is None:
            return Int64Index([])._ndarray_values
+
+        # Generate tuples of keys by which to order the results


this is really complex and adding quite a bit of code. Please take another look to simplify greatly.

Sure thing, will do. I did have a simpler solution, but the performance hit was really high.

jreback

can we just always sort if the the level is already lexsorted, bascially putting your additional codes into that you added into _udpate_indexer?

nrebena · 2019-11-09T19:39:39Z

This is kind of what I did in another PR #29190.
There is a bit of a performance drop on some benchmarking test.

To resume the two solution I made a PR for:

In this PR: I do not modify the way the indexer is computed, and I sorted everything at the end.
PR BUG: Preserve key order when using loc on MultiIndex DataFrame #29190: I keep the indexer sorted at all time, but with a bit of a performance drop for some benchmark, and improvement on other.

jreback · 2019-11-16T22:11:09Z

pandas/core/indexes/multi.py

@@ -3095,6 +3103,31 @@ def _update_indexer(idxr, indexer=indexer):
        # empty indexer
        if indexer is None:
            return Int64Index([])._ndarray_values
+
+        # Generate tuples of keys by which to order the results
+        if need_sort:


can you just check is_lexsorted?

This is not the same thing. The index may or may not be lexsorted, but what I want to know here is if the keys given to .loc are in the same order has the index (see line 3058), and if not, I reorder the result in indexer to have them in a order reflecting the given keys order.

pandas/core/indexes/multi.py

TomAugspurger

Can you add a whatsnew in 1.0.0.rst?

TomAugspurger · 2019-11-18T17:13:04Z

pandas/core/indexes/multi.py

+                if not need_sort:
+                    k_codes = self.levels[i].get_indexer(k)
+                    k_codes = k_codes[k_codes >= 0]  # Filter absent keys
+                    need_sort = not (k_codes[:-1] < k_codes[1:]).all()


I don't recall: does NumPy short-circut any? If so it may be faster to change the comparision to >=, the all to an any and remove the not (haven't tested).

According to this discussion https://stackoverflow.com/questions/45771554/why-numpy-any-has-no-short-circuit-mechanism#45773662, neither all nor any short-circuit anymore.
I will changed it however, as it improve readability imo.
Also, it point out an error I made, it should have been <= here, as two consecutive equals elements are still sorted. Nice catch.

pandas/core/indexes/multi.py

jreback · 2019-11-20T13:34:03Z

pandas/core/indexes/multi.py

+        from typing import Tuple
+
+        n = len(self)
+        keys = tuple()  # type: Tuple[np.ndarray, ...]


you can use the py36 form here

Do you mean no type hint?
I add this when mypy complained with the following
pandas/core/indexes/multi.py:3130: error: Need type annotation for 'keys'

pandas/tests/indexes/multi/test_indexing.py

pandas/core/indexes/multi.py

nrebena · 2019-11-20T22:36:18Z

When adding test cases, I come upon the following question.
What should be the ouput in this case

df = pd.DataFrame(
      np.arange(12).reshape((4, 3)),
      index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
      columns=[["Ohio", "Ohio", "Colorado"], ["Green", "Red", "Green"]],
      )

df.loc[(slice(None), [1,2]), :]                                                                                                
# actual output
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

# maybe expected
     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
b 2     9  10       11
a 1     0   1        2
b 1     6   7        8

# or
# maybe expected
     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
  1     0   1        2
b 2     9  10       11
  1     6   7        8

jreback · 2019-12-27T19:42:48Z

this got a bit lost, can you merge master and ping on green.

jreback

can you merge master.

i think it might be worthile to remove need_sort entirely (from the main code) and just always call _reorder_indexer where you can determine the need_sort (and just bail early if you don't need it).

right?

then

pandas/tests/indexes/multi/test_indexing.py

jreback · 2020-01-01T21:08:26Z

pandas/tests/indexes/multi/test_indexing.py

+
+
+def test_multiindex_loc_order():
+    # GH 22797


if you think its easy to parameterize this would be great as well.

pandas/core/indexes/multi.py

nrebena · 2020-01-03T20:50:10Z

i think it might be worthile to remove need_sort entirely (from the main code) and just always call _reorder_indexer where you can determine the need_sort (and just bail early if you don't need it).

This make the code more readable, thanks for the advice.

I merged master too.

I also added testing for columns selection.

doc/source/whatsnew/v1.0.0.rst

jreback · 2020-01-18T19:30:51Z

pandas/tests/test_multilevel.py

+        exp_index = pd.MultiIndex.from_arrays([["a", "a", "b", "b"], [1, 2, 1, 2]])
+        tm.assert_index_equal(res.index, exp_index)
+
+        res = df.loc[(["a", "b"], [1, 2]), :]


can you parameterize this, create additional tests for things that don't fit nicely

Done, but I fear i may have over parametrize…

jreback · 2020-01-18T19:34:55Z

pandas/core/indexes/multi.py

+                    # True if the given codes are not ordered
+                    need_sort = (k_codes[:-1] > k_codes[1:]).any()
+        # Bail out if no need to sort
+        # This is only true for a lexsorted index


can't we just test is_lexsorted?

isn't that sufficient here?

The property we need to test is "do the index and the selection share the same order".
If we have

df = pd.DataFrame(np.arange(2), index=[["b", "a"]]) df 0 b 0 a 1

Then the index is not lexsorted, and we do not need to sort when doing
df.loc[['b', 'a']], but we need to sort for getting df.loc[['a', 'b']] in the right order.

But they here room for improvement in this place, and I will propose something.

So I made a minor modification, but is the index is lexsorted, we still need to check if the requested order is also sorted before bailing out.

A another possibility is to sort in every case, if the performance hit is acceptable.

jreback

looks good. if you can update the whatsnew with an example and rebase can merge this in. sorry take a while, this is tricky code you are fixing.

jreback · 2020-01-26T01:32:27Z

doc/source/whatsnew/v1.1.0.rst

@@ -114,7 +114,7 @@ Missing

 MultiIndex
 ^^^^^^^^^^
-
+- Bug in :meth:`Dataframe.loc` when used with a :class:`MultiIndex`. The returned values were not in the same order as the given inputs (:issue:`22797`)


can you make a mini-example (separate sub-section). I think this is very hard to visualize from the text, but the basic example we are using is very clear.

Sure thing.

jreback · 2020-02-02T19:41:05Z

@nrebena sorry, if you'd merge master again. ping on green.

Testing return order of MultiIndex.loc MultiIndex.loc try to return the result in the same order as the key given.

From issue pandas-dev#22797. When given a list like object as indexer, the returned result did not respect the order of the indexer, but the order of the MultiIndex levels.

Test if the result of the loc function need to be sorted to return them in the same order as the indexer. If not, skip the sort to improve performance.

Move code from get_locs to _reorder_indexer. Better use of get_indexer to get level_code location.

This reverts commit 7fee53c.

nrebena · 2020-02-02T22:03:14Z

@jreback Merged and green

jreback · 2020-02-02T22:21:14Z

thanks @nrebena very nice! sorry took a long time :>

nrebena · 2020-02-02T22:22:53Z

@jreback It was a pleasure, I learned a lot. Thank for the guidance.

jreback requested changes Oct 12, 2019

View reviewed changes

jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Oct 12, 2019

nrebena mentioned this pull request Oct 23, 2019

BUG: Preserve key order when using loc on MultiIndex DataFrame #29190

Closed

5 tasks

jreback requested changes Nov 2, 2019

View reviewed changes

jreback requested changes Nov 16, 2019

View reviewed changes

jreback requested changes Nov 18, 2019

View reviewed changes

pandas/core/indexes/multi.py Outdated Show resolved Hide resolved

pandas/core/indexes/multi.py Show resolved Hide resolved

TomAugspurger reviewed Nov 18, 2019

View reviewed changes

jreback requested changes Nov 20, 2019

View reviewed changes

jreback requested changes Jan 1, 2020

View reviewed changes

WillAyd requested changes Jan 16, 2020

View reviewed changes

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

jreback requested changes Jan 18, 2020

View reviewed changes

jreback added this to the 1.1 milestone Jan 26, 2020

jreback requested changes Jan 26, 2020

View reviewed changes

nrebena force-pushed the multiindex_sort_loc_order_issue_22797 branch 3 times, most recently from 8251ada to 87a4afd Compare January 27, 2020 20:21

nrebena added 8 commits February 2, 2020 22:27

TST: Test for issue pandas-dev#22797

2753c79

Testing return order of MultiIndex.loc MultiIndex.loc try to return the result in the same order as the key given.

BUG: sort MultiIndex DataFrame loc result

af5c678

From issue pandas-dev#22797. When given a list like object as indexer, the returned result did not respect the order of the indexer, but the order of the MultiIndex levels.

PERF: Skip sort of MultiIndex DataFrame loc result if not needed

dd53a91

Test if the result of the loc function need to be sorted to return them in the same order as the indexer. If not, skip the sort to improve performance.

CLN: Some code simplification

97b952d

CLN: Move code into separate function

3fa3c6d

Move code from get_locs to _reorder_indexer. Better use of get_indexer to get level_code location.

CLN: More typing and linting

8b5ec48

CLN: Improve readability and doc

fb33627

DOC: Add a whatsnew

2c6195f

nrebena added 12 commits February 2, 2020 22:29

CLN: More typing for _reorder_indexer

d911110

TST: Add more test cases to test_multiindex_loc_order

81edea5

TST: move test_multiindex_loc_order to tests/test_multilevel.py

f1407f1

CLN: Move need_sort test in _reorder_indexer

4c667a7

TST: Test also for columns

ee89a33

FIX: Test need sort only work on lexsorted indexes

c18d60d

Minor change in how to determined if sorting is needed

1717a14

PERF: Delete flag for sorting multiindex loc call.

2ab8e30

TST: Parametrize tests

edde717

DOC: Move whatsnew entry from v1.0.0 to v1.1.0

82e5109

Revert "PERF: Delete flag for sorting multiindex loc call."

3367109

This reverts commit 7fee53c.

DOC: Add mini example to whatsnew

025d304

nrebena force-pushed the multiindex_sort_loc_order_issue_22797 branch from 87a4afd to 025d304 Compare February 2, 2020 21:30

jreback approved these changes Feb 2, 2020

View reviewed changes

jreback merged commit df5572c into pandas-dev:master Feb 2, 2020

nrebena deleted the multiindex_sort_loc_order_issue_22797 branch March 16, 2020 19:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Preserve key order when using loc on MultiIndex DataFrame #28933

BUG: Preserve key order when using loc on MultiIndex DataFrame #28933

nrebena commented Oct 11, 2019 •

edited

Loading

jreback Oct 12, 2019

nrebena Oct 15, 2019

jreback left a comment

nrebena commented Nov 9, 2019

jreback Nov 16, 2019

nrebena Nov 17, 2019

TomAugspurger left a comment

TomAugspurger Nov 18, 2019

nrebena Nov 18, 2019 •

edited

Loading

jreback Nov 20, 2019

nrebena Nov 20, 2019

nrebena commented Nov 20, 2019 •

edited

Loading

jreback commented Dec 27, 2019

jreback left a comment

jreback Jan 1, 2020

nrebena commented Jan 3, 2020 •

edited

Loading

jreback Jan 18, 2020

nrebena Jan 19, 2020

jreback Jan 18, 2020

jreback Jan 18, 2020

nrebena Jan 19, 2020

nrebena Jan 19, 2020

jreback left a comment

jreback Jan 26, 2020

nrebena Jan 26, 2020

jreback commented Feb 2, 2020

nrebena commented Feb 2, 2020

jreback commented Feb 2, 2020

nrebena commented Feb 2, 2020

BUG: Preserve key order when using loc on MultiIndex DataFrame #28933

BUG: Preserve key order when using loc on MultiIndex DataFrame #28933

Conversation

nrebena commented Oct 11, 2019 • edited Loading

Description

Proposed fix

Benchmark

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

nrebena commented Nov 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nrebena Nov 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nrebena commented Nov 20, 2019 • edited Loading

jreback commented Dec 27, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nrebena commented Jan 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Feb 2, 2020

nrebena commented Feb 2, 2020

jreback commented Feb 2, 2020

nrebena commented Feb 2, 2020

nrebena commented Oct 11, 2019 •

edited

Loading

nrebena Nov 18, 2019 •

edited

Loading

nrebena commented Nov 20, 2019 •

edited

Loading

nrebena commented Jan 3, 2020 •

edited

Loading