Skip to content

Confusion around sortedness with MultiIndex #10651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tangobravo opened this issue Jul 22, 2015 · 8 comments
Closed

Confusion around sortedness with MultiIndex #10651

tangobravo opened this issue Jul 22, 2015 · 8 comments

Comments

@tangobravo
Copy link

I find the docs around MultiIndex slicers to be quite confusing. It implies the MultiIndex needs to be lexsorted, and introduces the sortlevel() function but then has a caveat that this doesn't actually ensure sortedness.

There's some more details of my explorations and questions on StackOverflow:
http://stackoverflow.com/questions/31427466/ensuring-lexicographical-sort-in-pandas-multiindex

I'd like either a simple one-liner to ensure lexsortedness, more reassurance that the usual ways to create a MultiIndex will lexsort the labels in each level, or some more elaboration in the docs about exactly what the issues will be with indexing and slicing if the labels are not lexsorted.

Does my example show a bug in is_lexsorted too? I would expect sorted2.is_lexsorted() to be false here, as 'col1' is not lexsorted.

In [8]:
sorted2 = df3.sortlevel()
sorted2

Out[8]: 
            data
col1 col2       
b    1     three
     3       one
d    1       two
a    2      four

In [9]: sorted2.index.is_lexsorted()
Out[9]: True

In [10]: sorted2.index
Out[10]: 
MultiIndex(levels=[[u'b', u'd', u'a'], [1, 2, 3]],
           labels=[[0, 0, 1, 2], [0, 2, 0, 1]],
           names=[u'col1', u'col2'])
@jreback
Copy link
Contributor

jreback commented Jul 22, 2015

You are actually changing something else here. This is why using inplace is ALWAYS a bad idea. It works correctly, but the user is confused.

In [16]: df2.index.is_lexsorted()
Out[16]: False

In [18]: df3 = df2.sortlevel()      

In [19]: df3.index.is_lexsorted()
Out[19]: True

In [20]: df2.index.is_lexsorted()
Out[20]: False

# make sure that we are copying the index for demonstration
In [21]: df4 = df2.copy(deep=True)

In [22]: df4.index.is_lexsorted()
Out[22]: False

# the correct way to set things
In [23]: df4.index = df4.index.set_levels(['b','d','a'], level='col1')

In [24]: df4.index.is_lexsorted()
Out[24]: False

In [25]: df4.index = df4.index.set_labels([0,1,0,2], level='col1')

In [26]: df4.index.is_lexsorted()
Out[26]: False

@jreback
Copy link
Contributor

jreback commented Jul 22, 2015

.sortlevel will create a new frame and guarantee sortedness. Note that this is not a very new method and has been around quite a while.

@jreback
Copy link
Contributor

jreback commented Jul 22, 2015

#9212 might be enlightening for you

@jreback jreback closed this as completed Jul 22, 2015
@tangobravo
Copy link
Author

@jreback Thanks for the replies, but I'm still confused. Even using your example code to avoid inplace when changing the index, for me sortlevel returns a dataframe that is not lexsorted. It will return an index where the labels are sorted in that level (ie [0,0,1,2]) but the "levels" are not rearranged and are not lexographically sorted (still ['b','d','a']).

This appears to be the behaviour described in this part of the docs:
http://pandas.pydata.org/pandas-docs/stable/advanced.html#the-need-for-sortedness-with-multiindex
"There is an important new method sortlevel to sort an axis within a MultiIndex so that its labels are grouped and sorted by the original ordering of the associated factor at that level. Note that this does not necessarily mean the labels will be sorted lexicographically!"

Which contradicts the description in the sortlevel reference, which states "Data will be lexicographically sorted by the chosen level followed by the other levels (in order)":
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sortlevel.html?highlight=sortlevel#pandas.DataFrame.sortlevel

@jreback
Copy link
Contributor

jreback commented Jul 22, 2015

pls show a complete example. i am not sure what you are referring.

@tangobravo
Copy link
Author

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'col1':['b','a','a'], 'col2':[1,3,1], 'col3':[1,2,3]})

In [3]: df
Out[3]: 
  col1  col2  col3
0    b     1     1
1    a     3     2
2    a     1     3

In [4]: df2 = df.set_index(['col1','col2'])

In [5]: df2
Out[5]: 
           col3
col1 col2      
b    1        1
a    3        2
     1        3

In [6]: df3 = df2.sortlevel()

In [7]: df3
Out[7]: 
           col3
col1 col2      
a    1        3
     3        2
b    1        1

In [8]: df3.index.is_lexsorted()
Out[8]: True

# This is all as expected so far

In [9]: df3.index
Out[9]: 
MultiIndex(levels=[[u'a', u'b'], [1, 3]],
           labels=[[0, 0, 1], [0, 1, 0]],
           names=[u'col1', u'col2'])

# Reorder the elements in the levels array, but reset labels too so data is the same

In [10]: df3.index = df3.index.set_levels(['b','a'], level='col1')

In [11]: df3.index = df3.index.set_labels([1,1,0], level='col1')

In [12]: df3
Out[12]: 
           col3
col1 col2      
a    1        3
     3        2
b    1        1

# Data looks the same as Out[7], and hence looks lexsorted, but

In [13]: df3.index.is_lexsorted()
Out[13]: False

In [14]: df4 = df3.sortlevel()

In [15]: df4
Out[15]: 
           col3
col1 col2      
b    1        1
a    1        3
     3        2

# This doesn't look lexsorted any more, but...

In [16]: df4.index.is_lexsorted()
Out[16]: True

@jreback
Copy link
Contributor

jreback commented Jul 22, 2015

so I don't think the .set_levels/.set_labels reset the sortedness indicator (eg. needs to reset to None so its recomputed). This is the same bug as in #9212.

@tangobravo
Copy link
Author

I don't think this is just about resetting indicators.

Looking at the implementation of sortlevel() it doesn't consider the "levels" array of the MultiIndex at all, just blindly copies that across to the new index https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L4833 and ensures the sort is by the "labels" array. That explains the behaviour I'm seeing, and the comment in the general docs that states it won't ensure the "labels" are sorted, but suggests the sortlevel() reference that mentions that it will sort lexicographically is incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants