-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Confusion around sortedness with MultiIndex #10651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You are actually changing something else here. This is why using
|
|
#9212 might be enlightening for you |
@jreback Thanks for the replies, but I'm still confused. Even using your example code to avoid inplace when changing the index, for me sortlevel returns a dataframe that is not lexsorted. It will return an index where the labels are sorted in that level (ie [0,0,1,2]) but the "levels" are not rearranged and are not lexographically sorted (still ['b','d','a']). This appears to be the behaviour described in this part of the docs: Which contradicts the description in the sortlevel reference, which states "Data will be lexicographically sorted by the chosen level followed by the other levels (in order)": |
pls show a complete example. i am not sure what you are referring. |
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'col1':['b','a','a'], 'col2':[1,3,1], 'col3':[1,2,3]})
In [3]: df
Out[3]:
col1 col2 col3
0 b 1 1
1 a 3 2
2 a 1 3
In [4]: df2 = df.set_index(['col1','col2'])
In [5]: df2
Out[5]:
col3
col1 col2
b 1 1
a 3 2
1 3
In [6]: df3 = df2.sortlevel()
In [7]: df3
Out[7]:
col3
col1 col2
a 1 3
3 2
b 1 1
In [8]: df3.index.is_lexsorted()
Out[8]: True
# This is all as expected so far
In [9]: df3.index
Out[9]:
MultiIndex(levels=[[u'a', u'b'], [1, 3]],
labels=[[0, 0, 1], [0, 1, 0]],
names=[u'col1', u'col2'])
# Reorder the elements in the levels array, but reset labels too so data is the same
In [10]: df3.index = df3.index.set_levels(['b','a'], level='col1')
In [11]: df3.index = df3.index.set_labels([1,1,0], level='col1')
In [12]: df3
Out[12]:
col3
col1 col2
a 1 3
3 2
b 1 1
# Data looks the same as Out[7], and hence looks lexsorted, but
In [13]: df3.index.is_lexsorted()
Out[13]: False
In [14]: df4 = df3.sortlevel()
In [15]: df4
Out[15]:
col3
col1 col2
b 1 1
a 1 3
3 2
# This doesn't look lexsorted any more, but...
In [16]: df4.index.is_lexsorted()
Out[16]: True |
so I don't think the |
I don't think this is just about resetting indicators. Looking at the implementation of sortlevel() it doesn't consider the "levels" array of the MultiIndex at all, just blindly copies that across to the new index https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L4833 and ensures the sort is by the "labels" array. That explains the behaviour I'm seeing, and the comment in the general docs that states it won't ensure the "labels" are sorted, but suggests the sortlevel() reference that mentions that it will sort lexicographically is incorrect. |
I find the docs around MultiIndex slicers to be quite confusing. It implies the MultiIndex needs to be lexsorted, and introduces the sortlevel() function but then has a caveat that this doesn't actually ensure sortedness.
There's some more details of my explorations and questions on StackOverflow:
http://stackoverflow.com/questions/31427466/ensuring-lexicographical-sort-in-pandas-multiindex
I'd like either a simple one-liner to ensure lexsortedness, more reassurance that the usual ways to create a MultiIndex will lexsort the labels in each level, or some more elaboration in the docs about exactly what the issues will be with indexing and slicing if the labels are not lexsorted.
Does my example show a bug in is_lexsorted too? I would expect sorted2.is_lexsorted() to be false here, as 'col1' is not lexsorted.
The text was updated successfully, but these errors were encountered: