BUG: different behaviors of sort_index() and sort_index(level=0) #13431

zym1010 · 2016-06-12T14:13:48Z

Inspired by some bug reports around multiindex sortedness (http://stackoverflow.com/questions/31427466/ensuring-lexicographical-sort-in-pandas-multiindex, #10651, #9212), I found that sort_index() sometimes can't make a multiindex ready for slicing, but sort_index(level=0) (so does sortlevel()) can.

In [119]: pd.__version__
Out[119]: u'0.18.1'

In [120]: df = pd.DataFrame({'col1': ['b','d','b','a'], 'col2': [3,1,1,2], 'data':['one','two','three','four']})

In [121]: df2 = df.set_index(['col1','col2'])

In [122]: df2.index.set_levels(['b','d','a'], level='col1', inplace=True)

In [123]: df2.index.set_labels([0,1,0,2], level='col1', inplace=True)

In [124]: df2.sortlevel()
Out[124]:
            data
col1 col2
b    1     three
     3       one
d    1       two
a    2      four

In [125]: df2.sort_index()
Out[125]:
            data
col1 col2
a    2      four
b    1     three
     3       one
d    1       two

In [126]: df2.sort_index(level=0)
Out[126]:
            data
col1 col2
b    1     three
     3       one
d    1       two
a    2      four

While df2.sort_index() does give a visually lexicographically sorted output, it DOES NOT support slicing.

df2.sort_index().loc['b':'d']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

# a lot of lines omitted here.

/Users/yimengzh/miniconda/envs/cafferc3/lib/python2.7/site-packages/pandas/indexes/multi.py in _partial_tup_index(self, tup, side)
   1488             raise KeyError('Key length (%d) was greater than MultiIndex'
   1489                            ' lexsort depth (%d)' %
-> 1490                            (len(tup), self.lexsort_depth))
   1491
   1492         n = len(tup)

KeyError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

So I have two questions.

Is this the intended behavior? I thought level=0 and level=None are synonyms to me, but they are not. Looking at the code, https://github.com/pydata/pandas/blob/4de83d25d751d8ca102867b2d46a5547c01d7248/pandas/core/frame.py#L3245-L3247 indeed there's a special processing when level is not None.
What does "lexicographically sorted" mean? I think it should mean sorted in terms of levels, not labels. Is this what "lexicographically sorted" means in the doc for Advanced Indexing? If this is true, then I think make_index(level=0) is correct, yet make_index() is not.

Thanks.

output of `pd.show_versions()`

In [130]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-06-12T14:24:18Z

Might be a bug, see the docs here, these have been updated since 0.18.1; these should be the same. But you will have to step thru and have a look.

In [10]: df2.sort_index().index.lexsort_depth
Out[10]: 0

In [12]: df2.sort_index(level=0).index.lexsort_depth
Out[12]: 2

zym1010 · 2016-06-12T14:29:31Z

@jreback right I have seen that. Thus I think the updated doc is good, without all that complicated descriptions about lexicographically sortedness. But maybe the current implementation is inconsistent with the updated doc...

Anyway, I think some workarounds for all multiindex related issues are

when constructing the dataframe, always use the vanilla, integer index
after constructing the data in the dataframe, use set_index to create multiindex
don't ever play around the levels of the index. Basically, don't use factors (categorical variables with orders)

Currently my workflow is compatible with the above three, so this bug is not a problem for me now, but I think this is a big issue needing some clarification.

zym1010 · 2016-06-12T16:16:46Z

@jreback

I think the problem is due to a potential re initialization of factor levels in https://github.com/pydata/pandas/blob/4de83d25d751d8ca102867b2d46a5547c01d7248/pandas/core/frame.py#L3245-L3259.

According to the code, if level is not None (which I believe implies index being a MultiIndex), then the MultiIndex.sortlevel is used, which respects factor level.

However, if level is None and index is MultiIndex, the code will first check if it's sorted (in terms of factor levels, not naive lexicographical order). If not, then the levels will be re initialized in naive lexicographical order, and _lexsort_indexer will make the labels follow naive lexicographical order, yet levels are not updated when returning the new dataframe, because the reinitialized labels in https://github.com/pydata/pandas/blob/4de83d25d751d8ca102867b2d46a5547c01d7248/pandas/core/frame.py#L3255-L3256 is NOT used to update the original index, making the levels and labels inconsistent.

This may explain #9212, I believe. Basically, whether sort_index() will do naive lexicographical sorting, or factor-respecting sorting depends on whether the original index is sorted in a factor-respecting way, which looks like a bug to me.

jreback · 2016-06-12T16:35:31Z

could very well be

do you want to put in place tests for this issue (and other); and make a change -- see if you can fix without breaking em anything else?

zym1010 · 2016-06-12T16:43:15Z

@jreback I'd like to, though may be no time this month, and I'm a beginner into pandas, starting to use it just yesterday... But maybe this looks easy enough to me.

My main question is

what's the expected behavior of DataFrame.sort_index()? I believe in any case, it should return a sorted index ready for slicing, but should the factor levels in naive order, or should the original factor levels be preserved?

Actually I found comments on https://github.com/pydata/pandas/blob/4de83d25d751d8ca102867b2d46a5547c01d7248/pandas/core/frame.py#L3253-L3254 to be funny. If labels.is_lexsorted(), then why any further more sorting is needed anymore?

jreback · 2016-06-12T16:49:38Z

.sort_index() and .sort_index(level=0) should yield the same index. The factors are the same (and don't change) in all cases.

See here for the contributing docs

zym1010 · 2016-06-12T17:00:01Z

looks to me simply removing https://github.com/pydata/pandas/blob/4de83d25d751d8ca102867b2d46a5547c01d7248/pandas/core/frame.py#L3255-L3256 should do. But I have one question: does na_position have any effect for sort_index? looks there can't be any NaN in index labels (which are integers).

jreback · 2016-06-12T17:02:43Z

there can be NaN (though generally you shouldn't have them), behavior may be slightly odd / not completely tested though.

zym1010 · 2016-06-12T17:59:28Z

Looks that removing those lines will undo #8017. Actually, the current version of pandas (0.18.1) won't have same behavior on sort_index(level=0) and sort_index() on the example dataframe in that one either. Looks that adding columns with a multiindex is really troublesome.

I think the fundamental problem is that when using a multindex, each subindex in it becomes a categorical variable implicitly (labels and levels), including things that are not categorical in nature, such as float.

zym1010 · 2016-06-12T18:41:56Z

Basically I don't think given the current implementation, one can tell the difference between "true" is_lexsorted() and accidental is_lexsorted() due to the fact that when there's a new value added to a multiindex, it's always in the back (so if an multiindex is sorted in the beginning, it will be always sorted when adding more values to it). Maybe add some flag, say flush to sort_index, and let users choose whether to reconstruct the index or not.

jreback · 2016-06-14T13:12:46Z

its possible that issue was confused because it exposed a printing issue.

I don't think there is any difference between true lexsorted and accidental, except that accidental might just be not recorded as such (so its a bug in keeping state).

zym1010 · 2016-06-14T15:14:27Z

as you can see from the example in #8017, although print gives correct result, the actual index level for the float index is wrong, and it's not lexsorted, not ready for slicing.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]:

In [3]: np.random.seed(0)

In [4]: data = np.random.randn(3,4)

In [5]:

In [5]: df_multi_float = pd.DataFrame(data, index=list('def'), columns=pd.MultiIndex.from_tuples([('red', i) for i in [1., 3., 2., 5.]]))

In [6]: df_multi_float[('red', 4.0)] = 'world'

In [7]: a=df_multi_float.sort_index(axis=1)

In [8]: a
Out[8]:
        red
        1.0       2.0       3.0    4.0       5.0
d  1.764052  0.978738  0.400157  world  2.240893
e  1.867558  0.950088 -0.977278  world -0.151357
f -0.103219  0.144044  0.410599  world  1.454274

In [9]: a.columns
Out[9]:
MultiIndex(levels=[[u'red'], [1.0, 2.0, 3.0, 5.0, 4.0]],
           labels=[[0, 0, 0, 0, 0], [0, 1, 2, 4, 3]])

In [10]: pd.__version__
Out[10]: u'0.18.1'

In [11]: a.columns.is_lexsorted()
Out[11]: False

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 xref pandas-dev#13431

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added the MultiIndex label Jun 12, 2016

jreback mentioned this issue Jun 12, 2016

sort_index behavior differs for the same DataFrame? #9212

Closed

jreback added Bug Difficulty Intermediate labels Jun 12, 2016

jreback added this to the 0.18.2 milestone Jun 12, 2016

jreback changed the title ~~different behaviors of sort_index() and sort_index(level=0)~~ BUG: different behaviors of sort_index() and sort_index(level=0) Jun 12, 2016

jreback mentioned this issue Aug 17, 2016

BUG: multi-indexing sorting on axis=1 on >0 levels #14015

Closed

jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Sep 1, 2016

jorisvandenbossche mentioned this issue Feb 9, 2017

sort_index(axis=1) doesn't sort integers #15355

Closed

jreback mentioned this issue Mar 15, 2017

BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694

Closed

jreback added a commit to jreback/pandas that referenced this issue Mar 15, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

e7c0c14

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 xref pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 15, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

72bc7d0

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 xref pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

698e05f

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

a6f352c

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

54c6e93

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

1a9be09

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

685dadc

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 16, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

efaf233

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 19, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

dd85330

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

933710c

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

53c7e6c

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 22, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

083b4b0

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 23, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

7784ec0

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Mar 25, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

09f842d

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Apr 2, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

ae3777e

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Apr 4, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

84c8999

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Apr 4, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

53b844d

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Apr 6, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

a1390ce

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback added a commit to jreback/pandas that referenced this issue Apr 7, 2017

BUG: construct MultiIndex identically from levels/labels when concatting

47c67d6

closes pandas-dev#15622 closes pandas-dev#15687 closes pandas-dev#14015 closes pandas-dev#13431

jreback closed this as completed in f478e4f Apr 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: different behaviors of sort_index() and sort_index(level=0) #13431

BUG: different behaviors of sort_index() and sort_index(level=0) #13431

zym1010 commented Jun 12, 2016 •

edited

Loading

jreback commented Jun 12, 2016 •

edited

Loading

zym1010 commented Jun 12, 2016 •

edited

Loading

zym1010 commented Jun 12, 2016 •

edited

Loading

jreback commented Jun 12, 2016

zym1010 commented Jun 12, 2016 •

edited

Loading

jreback commented Jun 12, 2016

zym1010 commented Jun 12, 2016

jreback commented Jun 12, 2016

zym1010 commented Jun 12, 2016 •

edited

Loading

zym1010 commented Jun 12, 2016 •

edited

Loading

jreback commented Jun 14, 2016

zym1010 commented Jun 14, 2016

BUG: different behaviors of sort_index() and sort_index(level=0) #13431

BUG: different behaviors of sort_index() and sort_index(level=0) #13431

Comments

zym1010 commented Jun 12, 2016 • edited Loading

output of pd.show_versions()

jreback commented Jun 12, 2016 • edited Loading

zym1010 commented Jun 12, 2016 • edited Loading

zym1010 commented Jun 12, 2016 • edited Loading

jreback commented Jun 12, 2016

zym1010 commented Jun 12, 2016 • edited Loading

jreback commented Jun 12, 2016

zym1010 commented Jun 12, 2016

jreback commented Jun 12, 2016

zym1010 commented Jun 12, 2016 • edited Loading

zym1010 commented Jun 12, 2016 • edited Loading

jreback commented Jun 14, 2016

zym1010 commented Jun 14, 2016

zym1010 commented Jun 12, 2016 •

edited

Loading

output of `pd.show_versions()`

jreback commented Jun 12, 2016 •

edited

Loading

zym1010 commented Jun 12, 2016 •

edited

Loading

zym1010 commented Jun 12, 2016 •

edited

Loading

zym1010 commented Jun 12, 2016 •

edited

Loading

zym1010 commented Jun 12, 2016 •

edited

Loading

zym1010 commented Jun 12, 2016 •

edited

Loading