Skip to content

BUG: Preserve key order when using loc on MultiIndex DataFrame #28933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nrebena
Copy link
Contributor

@nrebena nrebena commented Oct 11, 2019

Description

closes #22797
As described in #22797, the key order given to loc for a MultiIndex DataFrame was not respected:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'],
    ['Green', 'Red', 'Green']])

df.loc[(['b','a'],[2, 1]),:]

# Out
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

Proposed fix

The culprit was the use of intersection of indexers in the loc function. I tried keeping the indexers sorted during the whole function (in the main loop), but performance were really affected (by a factor 3!!!).
As an other solution, I tried to sort the result after the indexers were computed. It was already way better (worse "only" by a factor 1.15 or so, see the asv benchmark result).
So I computed and add a flag testing if the result need to be sorted (the benchmark seems to always have sorted key in the loc call).

Update The sorting function is now a separate private function (_reorder_indexer). It is called at the end of the get_locs function.

Benchmark

Benchmark with the flag (I run asv compare with -s option):

Benchmarks that have got worse:
   before           after         ratio
 [39602e7d]       [da8b55af]
 <master>         <multiindex_sort_loc_order_issue_22797>
  •  5.62±0.2μs       6.27±0.2μs     1.11  index_cached_properties.IndexCache.time_shape('Float64Index')
    
  •  6.57±0.2μs       7.49±0.2μs     1.14  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')
    

Benchmark without flag:

Benchmarks that have got worse:
   before           after         ratio
 [39602e7d]       [c786822a]
 <master>         <multiindex_sort_loc_order_issue_22797~1>
  • 2.49±0.02ms      2.87±0.01ms     1.15  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'int')
    
  •    2.53±0ms      2.91±0.01ms     1.15  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'int')
    
  •  29.2±0.7ms      33.1±0.02ms     1.13  frame_ctor.FromLists.time_frame_from_lists
    
  •    87.2±1ms         98.9±1ms     1.13  frame_ctor.FromRecords.time_frame_from_records_generator(None)
    
  • 12.8±0.09ms      14.3±0.09ms     1.11  groupby.MultiColumn.time_col_select_numpy_sum
    
  •  5.62±0.2μs       6.32±0.4μs     1.12  index_cached_properties.IndexCache.time_shape('Float64Index')
    
  • 4.96±0.02ms      5.71±0.01ms     1.15  indexing.MultiIndexing.time_index_slice
    
  •    2.91±0ms      3.29±0.01ms     1.13  inference.ToNumeric.time_from_numeric_str('coerce')
    
  •    2.92±0ms      3.29±0.01ms     1.13  inference.ToNumeric.time_from_numeric_str('ignore')
    
  • 3.45±0.01ms      3.84±0.01ms     1.11  series_methods.Map.time_map('lambda', 'object')
    
  •  29.3±0.2ms      33.2±0.04ms     1.13  strings.Methods.time_len
    

Checklist

@@ -3095,6 +3105,32 @@ def _update_indexer(idxr, indexer=indexer):
# empty indexer
if indexer is None:
return Int64Index([])._ndarray_values

# Generate tuples of keys by which to order the results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is really complex and adding quite a bit of code. Please take another look to simplify greatly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, will do. I did have a simpler solution, but the performance hit was really high.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Oct 12, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just always sort if the the level is already lexsorted, bascially putting your additional codes into that you added into _udpate_indexer?

@nrebena
Copy link
Contributor Author

nrebena commented Nov 9, 2019

This is kind of what I did in another PR #29190.
There is a bit of a performance drop on some benchmarking test.

To resume the two solution I made a PR for:

@@ -3095,6 +3103,31 @@ def _update_indexer(idxr, indexer=indexer):
# empty indexer
if indexer is None:
return Int64Index([])._ndarray_values

# Generate tuples of keys by which to order the results
if need_sort:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just check is_lexsorted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the same thing. The index may or may not be lexsorted, but what I want to know here is if the keys given to .loc are in the same order has the index (see line 3058), and if not, I reorder the result in indexer to have them in a order reflecting the given keys order.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a whatsnew in 1.0.0.rst?

if not need_sort:
k_codes = self.levels[i].get_indexer(k)
k_codes = k_codes[k_codes >= 0] # Filter absent keys
need_sort = not (k_codes[:-1] < k_codes[1:]).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall: does NumPy short-circut any? If so it may be faster to change the comparision to >=, the all to an any and remove the not (haven't tested).

Copy link
Contributor Author

@nrebena nrebena Nov 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to this discussion https://stackoverflow.com/questions/45771554/why-numpy-any-has-no-short-circuit-mechanism#45773662, neither all nor any short-circuit anymore.
I will changed it however, as it improve readability imo.
Also, it point out an error I made, it should have been <= here, as two consecutive equals elements are still sorted. Nice catch.

from typing import Tuple

n = len(self)
keys = tuple() # type: Tuple[np.ndarray, ...]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the py36 form here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean no type hint?
I add this when mypy complained with the following
pandas/core/indexes/multi.py:3130: error: Need type annotation for 'keys'

@nrebena
Copy link
Contributor Author

nrebena commented Nov 20, 2019

When adding test cases, I come upon the following question.
What should be the ouput in this case

df = pd.DataFrame(
      np.arange(12).reshape((4, 3)),
      index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
      columns=[["Ohio", "Ohio", "Colorado"], ["Green", "Red", "Green"]],
      )

df.loc[(slice(None), [1,2]), :]                                                                                                
# actual output
     Ohio     Colorado
    Green Red    Green
a 1     0   1        2
  2     3   4        5
b 1     6   7        8
  2     9  10       11

# maybe expected
     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
b 2     9  10       11
a 1     0   1        2
b 1     6   7        8

# or
# maybe expected
     Ohio     Colorado
    Green Red    Green
a 2     3   4        5
  1     0   1        2
b 2     9  10       11
  1     6   7        8

@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

this got a bit lost, can you merge master and ping on green.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you merge master.

i think it might be worthile to remove need_sort entirely (from the main code) and just always call _reorder_indexer where you can determine the need_sort (and just bail early if you don't need it).

right?

then



def test_multiindex_loc_order():
# GH 22797
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you think its easy to parameterize this would be great as well.

@nrebena
Copy link
Contributor Author

nrebena commented Jan 3, 2020

i think it might be worthile to remove need_sort entirely (from the main code) and just always call _reorder_indexer where you can determine the need_sort (and just bail early if you don't need it).

This make the code more readable, thanks for the advice.

I merged master too.

I also added testing for columns selection.

exp_index = pd.MultiIndex.from_arrays([["a", "a", "b", "b"], [1, 2, 1, 2]])
tm.assert_index_equal(res.index, exp_index)

res = df.loc[(["a", "b"], [1, 2]), :]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you parameterize this, create additional tests for things that don't fit nicely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but I fear i may have over parametrize…

# True if the given codes are not ordered
need_sort = (k_codes[:-1] > k_codes[1:]).any()
# Bail out if no need to sort
# This is only true for a lexsorted index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just test is_lexsorted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that sufficient here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property we need to test is "do the index and the selection share the same order".
If we have

df = pd.DataFrame(np.arange(2), index=[["b", "a"]]) 
df
   0
b  0
a  1

Then the index is not lexsorted, and we do not need to sort when doing
df.loc[['b', 'a']], but we need to sort for getting df.loc[['a', 'b']] in the right order.

But they here room for improvement in this place, and I will propose something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I made a minor modification, but is the index is lexsorted, we still need to check if the requested order is also sorted before bailing out.

A another possibility is to sort in every case, if the performance hit is acceptable.

@jreback jreback added this to the 1.1 milestone Jan 26, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. if you can update the whatsnew with an example and rebase can merge this in. sorry take a while, this is tricky code you are fixing.

@@ -114,7 +114,7 @@ Missing

MultiIndex
^^^^^^^^^^

- Bug in :meth:`Dataframe.loc` when used with a :class:`MultiIndex`. The returned values were not in the same order as the given inputs (:issue:`22797`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a mini-example (separate sub-section). I think this is very hard to visualize from the text, but the basic example we are using is very clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing.

@nrebena nrebena force-pushed the multiindex_sort_loc_order_issue_22797 branch 3 times, most recently from 8251ada to 87a4afd Compare January 27, 2020 20:21
@jreback
Copy link
Contributor

jreback commented Feb 2, 2020

@nrebena sorry, if you'd merge master again. ping on green.

Testing return order of MultiIndex.loc

MultiIndex.loc try to return the result in the same order as the key
given.
From issue pandas-dev#22797. When given a list like object as indexer, the
returned result did not respect the order of the indexer, but the order
of the MultiIndex levels.
Test if the result of the loc function need to be sorted to return them
in the same order as the indexer. If not, skip the sort to improve
performance.
Move code from get_locs to _reorder_indexer.
Better use of get_indexer to get level_code location.
@nrebena nrebena force-pushed the multiindex_sort_loc_order_issue_22797 branch from 87a4afd to 025d304 Compare February 2, 2020 21:30
@nrebena
Copy link
Contributor Author

nrebena commented Feb 2, 2020

@jreback Merged and green

@jreback jreback merged commit df5572c into pandas-dev:master Feb 2, 2020
@jreback
Copy link
Contributor

jreback commented Feb 2, 2020

thanks @nrebena very nice! sorry took a long time :>

@nrebena
Copy link
Contributor Author

nrebena commented Feb 2, 2020

@jreback It was a pleasure, I learned a lot. Thank for the guidance.

@nrebena nrebena deleted the multiindex_sort_loc_order_issue_22797 branch March 16, 2020 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

loc() does not swap two rows in multi-index pandas dataframe
4 participants