Skip to content

BUG: different behaviors of sort_index() and sort_index(level=0) #13431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zym1010 opened this issue Jun 12, 2016 · 12 comments
Closed

BUG: different behaviors of sort_index() and sort_index(level=0) #13431

zym1010 opened this issue Jun 12, 2016 · 12 comments

Comments

@zym1010
Copy link
Contributor

zym1010 commented Jun 12, 2016

Inspired by some bug reports around multiindex sortedness (http://stackoverflow.com/questions/31427466/ensuring-lexicographical-sort-in-pandas-multiindex, #10651, #9212), I found that sort_index() sometimes can't make a multiindex ready for slicing, but sort_index(level=0) (so does sortlevel()) can.

In [119]: pd.__version__
Out[119]: u'0.18.1'

In [120]: df = pd.DataFrame({'col1': ['b','d','b','a'], 'col2': [3,1,1,2], 'data':['one','two','three','four']})

In [121]: df2 = df.set_index(['col1','col2'])

In [122]: df2.index.set_levels(['b','d','a'], level='col1', inplace=True)

In [123]: df2.index.set_labels([0,1,0,2], level='col1', inplace=True)

In [124]: df2.sortlevel()
Out[124]:
            data
col1 col2
b    1     three
     3       one
d    1       two
a    2      four

In [125]: df2.sort_index()
Out[125]:
            data
col1 col2
a    2      four
b    1     three
     3       one
d    1       two

In [126]: df2.sort_index(level=0)
Out[126]:
            data
col1 col2
b    1     three
     3       one
d    1       two
a    2      four

While df2.sort_index() does give a visually lexicographically sorted output, it DOES NOT support slicing.

df2.sort_index().loc['b':'d']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)

# a lot of lines omitted here.

/Users/yimengzh/miniconda/envs/cafferc3/lib/python2.7/site-packages/pandas/indexes/multi.py in _partial_tup_index(self, tup, side)
   1488             raise KeyError('Key length (%d) was greater than MultiIndex'
   1489                            ' lexsort depth (%d)' %
-> 1490                            (len(tup), self.lexsort_depth))
   1491
   1492         n = len(tup)

KeyError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

So I have two questions.

  1. Is this the intended behavior? I thought level=0 and level=None are synonyms to me, but they are not. Looking at the code, https://github.com/pydata/pandas/blob/4de83d25d751d8ca102867b2d46a5547c01d7248/pandas/core/frame.py#L3245-L3247 indeed there's a special processing when level is not None.
  2. What does "lexicographically sorted" mean? I think it should mean sorted in terms of levels, not labels. Is this what "lexicographically sorted" means in the doc for Advanced Indexing? If this is true, then I think make_index(level=0) is correct, yet make_index() is not.

Thanks.

output of pd.show_versions()

In [130]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.23.4
numpy: 1.11.0
scipy: 0.17.0
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None
@jreback
Copy link
Contributor

jreback commented Jun 12, 2016

Might be a bug, see the docs here, these have been updated since 0.18.1; these should be the same. But you will have to step thru and have a look.

In [10]: df2.sort_index().index.lexsort_depth
Out[10]: 0

In [12]: df2.sort_index(level=0).index.lexsort_depth
Out[12]: 2

@zym1010
Copy link
Contributor Author

zym1010 commented Jun 12, 2016

@jreback right I have seen that. Thus I think the updated doc is good, without all that complicated descriptions about lexicographically sortedness. But maybe the current implementation is inconsistent with the updated doc...

Anyway, I think some workarounds for all multiindex related issues are

  1. when constructing the dataframe, always use the vanilla, integer index
  2. after constructing the data in the dataframe, use set_index to create multiindex
  3. don't ever play around the levels of the index. Basically, don't use factors (categorical variables with orders)

Currently my workflow is compatible with the above three, so this bug is not a problem for me now, but I think this is a big issue needing some clarification.

@zym1010
Copy link
Contributor Author

zym1010 commented Jun 12, 2016

@jreback

I think the problem is due to a potential re initialization of factor levels in https://github.com/pydata/pandas/blob/4de83d25d751d8ca102867b2d46a5547c01d7248/pandas/core/frame.py#L3245-L3259.

According to the code, if level is not None (which I believe implies index being a MultiIndex), then the MultiIndex.sortlevel is used, which respects factor level.

However, if level is None and index is MultiIndex, the code will first check if it's sorted (in terms of factor levels, not naive lexicographical order). If not, then the levels will be re initialized in naive lexicographical order, and _lexsort_indexer will make the labels follow naive lexicographical order, yet levels are not updated when returning the new dataframe, because the reinitialized labels in https://github.com/pydata/pandas/blob/4de83d25d751d8ca102867b2d46a5547c01d7248/pandas/core/frame.py#L3255-L3256 is NOT used to update the original index, making the levels and labels inconsistent.

This may explain #9212, I believe. Basically, whether sort_index() will do naive lexicographical sorting, or factor-respecting sorting depends on whether the original index is sorted in a factor-respecting way, which looks like a bug to me.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2016

could very well be

do you want to put in place tests for this issue (and other); and make a change -- see if you can fix without breaking em anything else?

@zym1010
Copy link
Contributor Author

zym1010 commented Jun 12, 2016

@jreback I'd like to, though may be no time this month, and I'm a beginner into pandas, starting to use it just yesterday... But maybe this looks easy enough to me.

My main question is

  1. what's the expected behavior of DataFrame.sort_index()? I believe in any case, it should return a sorted index ready for slicing, but should the factor levels in naive order, or should the original factor levels be preserved?

Actually I found comments on https://github.com/pydata/pandas/blob/4de83d25d751d8ca102867b2d46a5547c01d7248/pandas/core/frame.py#L3253-L3254 to be funny. If labels.is_lexsorted(), then why any further more sorting is needed anymore?

@jreback
Copy link
Contributor

jreback commented Jun 12, 2016

.sort_index() and .sort_index(level=0) should yield the same index. The factors are the same (and don't change) in all cases.

See here for the contributing docs

@jreback jreback added this to the 0.18.2 milestone Jun 12, 2016
@jreback jreback changed the title different behaviors of sort_index() and sort_index(level=0) BUG: different behaviors of sort_index() and sort_index(level=0) Jun 12, 2016
@zym1010
Copy link
Contributor Author

zym1010 commented Jun 12, 2016

looks to me simply removing https://github.com/pydata/pandas/blob/4de83d25d751d8ca102867b2d46a5547c01d7248/pandas/core/frame.py#L3255-L3256 should do. But I have one question: does na_position have any effect for sort_index? looks there can't be any NaN in index labels (which are integers).

@jreback
Copy link
Contributor

jreback commented Jun 12, 2016

there can be NaN (though generally you shouldn't have them), behavior may be slightly odd / not completely tested though.

@zym1010
Copy link
Contributor Author

zym1010 commented Jun 12, 2016

Looks that removing those lines will undo #8017. Actually, the current version of pandas (0.18.1) won't have same behavior on sort_index(level=0) and sort_index() on the example dataframe in that one either. Looks that adding columns with a multiindex is really troublesome.

I think the fundamental problem is that when using a multindex, each subindex in it becomes a categorical variable implicitly (labels and levels), including things that are not categorical in nature, such as float.

@zym1010
Copy link
Contributor Author

zym1010 commented Jun 12, 2016

Basically I don't think given the current implementation, one can tell the difference between "true" is_lexsorted() and accidental is_lexsorted() due to the fact that when there's a new value added to a multiindex, it's always in the back (so if an multiindex is sorted in the beginning, it will be always sorted when adding more values to it). Maybe add some flag, say flush to sort_index, and let users choose whether to reconstruct the index or not.

@jreback
Copy link
Contributor

jreback commented Jun 14, 2016

its possible that issue was confused because it exposed a printing issue.

I don't think there is any difference between true lexsorted and accidental, except that accidental might just be not recorded as such (so its a bug in keeping state).

@zym1010
Copy link
Contributor Author

zym1010 commented Jun 14, 2016

as you can see from the example in #8017, although print gives correct result, the actual index level for the float index is wrong, and it's not lexsorted, not ready for slicing.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]:

In [3]: np.random.seed(0)

In [4]: data = np.random.randn(3,4)

In [5]:

In [5]: df_multi_float = pd.DataFrame(data, index=list('def'), columns=pd.MultiIndex.from_tuples([('red', i) for i in [1., 3., 2., 5.]]))

In [6]: df_multi_float[('red', 4.0)] = 'world'

In [7]: a=df_multi_float.sort_index(axis=1)

In [8]: a
Out[8]:
        red
        1.0       2.0       3.0    4.0       5.0
d  1.764052  0.978738  0.400157  world  2.240893
e  1.867558  0.950088 -0.977278  world -0.151357
f -0.103219  0.144044  0.410599  world  1.454274

In [9]: a.columns
Out[9]:
MultiIndex(levels=[[u'red'], [1.0, 2.0, 3.0, 5.0, 4.0]],
           labels=[[0, 0, 0, 0, 0], [0, 1, 2, 4, 3]])

In [10]: pd.__version__
Out[10]: u'0.18.1'

In [11]: a.columns.is_lexsorted()
Out[11]: False

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Sep 1, 2016
@jreback jreback closed this as completed in f478e4f Apr 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants