Confusion around sortedness with MultiIndex #10651

tangobravo · 2015-07-22T08:35:40Z

I find the docs around MultiIndex slicers to be quite confusing. It implies the MultiIndex needs to be lexsorted, and introduces the sortlevel() function but then has a caveat that this doesn't actually ensure sortedness.

There's some more details of my explorations and questions on StackOverflow:
http://stackoverflow.com/questions/31427466/ensuring-lexicographical-sort-in-pandas-multiindex

I'd like either a simple one-liner to ensure lexsortedness, more reassurance that the usual ways to create a MultiIndex will lexsort the labels in each level, or some more elaboration in the docs about exactly what the issues will be with indexing and slicing if the labels are not lexsorted.

Does my example show a bug in is_lexsorted too? I would expect sorted2.is_lexsorted() to be false here, as 'col1' is not lexsorted.

In [8]:
sorted2 = df3.sortlevel()
sorted2

Out[8]: 
            data
col1 col2       
b    1     three
     3       one
d    1       two
a    2      four

In [9]: sorted2.index.is_lexsorted()
Out[9]: True

In [10]: sorted2.index
Out[10]: 
MultiIndex(levels=[[u'b', u'd', u'a'], [1, 2, 3]],
           labels=[[0, 0, 1, 2], [0, 2, 0, 1]],
           names=[u'col1', u'col2'])

jreback · 2015-07-22T12:28:26Z

You are actually changing something else here. This is why using inplace is ALWAYS a bad idea. It works correctly, but the user is confused.

In [16]: df2.index.is_lexsorted()
Out[16]: False

In [18]: df3 = df2.sortlevel()      

In [19]: df3.index.is_lexsorted()
Out[19]: True

In [20]: df2.index.is_lexsorted()
Out[20]: False

# make sure that we are copying the index for demonstration
In [21]: df4 = df2.copy(deep=True)

In [22]: df4.index.is_lexsorted()
Out[22]: False

# the correct way to set things
In [23]: df4.index = df4.index.set_levels(['b','d','a'], level='col1')

In [24]: df4.index.is_lexsorted()
Out[24]: False

In [25]: df4.index = df4.index.set_labels([0,1,0,2], level='col1')

In [26]: df4.index.is_lexsorted()
Out[26]: False

jreback · 2015-07-22T12:29:14Z

.sortlevel will create a new frame and guarantee sortedness. Note that this is not a very new method and has been around quite a while.

jreback · 2015-07-22T12:30:18Z

#9212 might be enlightening for you

tangobravo · 2015-07-22T12:58:16Z

@jreback Thanks for the replies, but I'm still confused. Even using your example code to avoid inplace when changing the index, for me sortlevel returns a dataframe that is not lexsorted. It will return an index where the labels are sorted in that level (ie [0,0,1,2]) but the "levels" are not rearranged and are not lexographically sorted (still ['b','d','a']).

This appears to be the behaviour described in this part of the docs:
http://pandas.pydata.org/pandas-docs/stable/advanced.html#the-need-for-sortedness-with-multiindex
"There is an important new method sortlevel to sort an axis within a MultiIndex so that its labels are grouped and sorted by the original ordering of the associated factor at that level. Note that this does not necessarily mean the labels will be sorted lexicographically!"

Which contradicts the description in the sortlevel reference, which states "Data will be lexicographically sorted by the chosen level followed by the other levels (in order)":
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sortlevel.html?highlight=sortlevel#pandas.DataFrame.sortlevel

jreback · 2015-07-22T13:24:57Z

pls show a complete example. i am not sure what you are referring.

tangobravo · 2015-07-22T13:43:29Z

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'col1':['b','a','a'], 'col2':[1,3,1], 'col3':[1,2,3]})

In [3]: df
Out[3]: 
  col1  col2  col3
0    b     1     1
1    a     3     2
2    a     1     3

In [4]: df2 = df.set_index(['col1','col2'])

In [5]: df2
Out[5]: 
           col3
col1 col2      
b    1        1
a    3        2
     1        3

In [6]: df3 = df2.sortlevel()

In [7]: df3
Out[7]: 
           col3
col1 col2      
a    1        3
     3        2
b    1        1

In [8]: df3.index.is_lexsorted()
Out[8]: True

# This is all as expected so far

In [9]: df3.index
Out[9]: 
MultiIndex(levels=[[u'a', u'b'], [1, 3]],
           labels=[[0, 0, 1], [0, 1, 0]],
           names=[u'col1', u'col2'])

# Reorder the elements in the levels array, but reset labels too so data is the same

In [10]: df3.index = df3.index.set_levels(['b','a'], level='col1')

In [11]: df3.index = df3.index.set_labels([1,1,0], level='col1')

In [12]: df3
Out[12]: 
           col3
col1 col2      
a    1        3
     3        2
b    1        1

# Data looks the same as Out[7], and hence looks lexsorted, but

In [13]: df3.index.is_lexsorted()
Out[13]: False

In [14]: df4 = df3.sortlevel()

In [15]: df4
Out[15]: 
           col3
col1 col2      
b    1        1
a    1        3
     3        2

# This doesn't look lexsorted any more, but...

In [16]: df4.index.is_lexsorted()
Out[16]: True

jreback · 2015-07-22T14:28:44Z

so I don't think the .set_levels/.set_labels reset the sortedness indicator (eg. needs to reset to None so its recomputed). This is the same bug as in #9212.

tangobravo · 2015-07-27T10:10:39Z

I don't think this is just about resetting indicators.

Looking at the implementation of sortlevel() it doesn't consider the "levels" array of the MultiIndex at all, just blindly copies that across to the new index https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L4833 and ensures the sort is by the "labels" array. That explains the behaviour I'm seeing, and the comment in the general docs that states it won't ensure the "labels" are sorted, but suggests the sortlevel() reference that mentions that it will sort lexicographically is incorrect.

jreback added MultiIndex Usage Question labels Jul 22, 2015

jreback closed this as completed Jul 22, 2015

zym1010 mentioned this issue Jun 12, 2016

BUG: different behaviors of sort_index() and sort_index(level=0) #13431

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion around sortedness with MultiIndex #10651

Confusion around sortedness with MultiIndex #10651

tangobravo commented Jul 22, 2015

jreback commented Jul 22, 2015

jreback commented Jul 22, 2015

jreback commented Jul 22, 2015

tangobravo commented Jul 22, 2015

jreback commented Jul 22, 2015

tangobravo commented Jul 22, 2015

jreback commented Jul 22, 2015

tangobravo commented Jul 27, 2015

Confusion around sortedness with MultiIndex #10651

Confusion around sortedness with MultiIndex #10651

Comments

tangobravo commented Jul 22, 2015

jreback commented Jul 22, 2015

jreback commented Jul 22, 2015

jreback commented Jul 22, 2015

tangobravo commented Jul 22, 2015

jreback commented Jul 22, 2015

tangobravo commented Jul 22, 2015

jreback commented Jul 22, 2015

tangobravo commented Jul 27, 2015