Skip to content

BUG: sort_index/sortlevel fails MultiIndex after columns are added. #8017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
8one6 opened this issue Aug 13, 2014 · 18 comments · Fixed by #8282
Closed

BUG: sort_index/sortlevel fails MultiIndex after columns are added. #8017

8one6 opened this issue Aug 13, 2014 · 18 comments · Fixed by #8282
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@8one6
Copy link

8one6 commented Aug 13, 2014

I have a DataFrame with a MultiIndex on the columns. The first level of the MultiIndex contains strings. The second, floats (though the problem persists if the second level is ints). I add a column to the DataFrame (which should not come last if the columns are sorted). I try to sort the DataFrame. The result does not seem to be sorted. The behavior is fine if the columns are simply an Index (even after adding columns). And the sort works fine in the MultiIndex case as long as no columns have been added since the DataFrame was created.

MWE:

import pandas as pd
import numpy as np

np.random.seed(0)
data = np.random.randn(3,4)

df_multi_float = pd.DataFrame(data, index=list('def'), columns=pd.MultiIndex.from_tuples([('red', i) for i in [1., 3., 2., 5.]]))

print df_multi_float

#OUTPUT
        red                              
          1         3         2         5
d  1.764052  0.400157  0.978738  2.240893
e  1.867558 -0.977278  0.950088 -0.151357
f -0.103219  0.410599  0.144044  1.454274

This sorts just fine as it isnow:

print df_multi_float.sort_index(axis=1)

#OUTPUT
        red                              
          1         2         3         5
d  1.764052  0.978738  0.400157  2.240893
e  1.867558  0.950088 -0.977278 -0.151357
f -0.103219  0.144044  0.410599  1.454274

But if I add columns to both this `DataFrame and then show it sorted, I get what looks to be a wrong result (the new column remains last, rather than being placed second-to-last as it should be):

df_multi_float[('red', 4.0)] = 'world'

print df_multi_float.sort_index(axis=1)

#OUTPUT
        red                                  red
          1         2         3         5      4
d  1.764052  0.978738  0.400157  2.240893  world
e  1.867558  0.950088 -0.977278 -0.151357  world
f -0.103219  0.144044  0.410599  1.454274  world

I'm able to produce this behavior on two systems. The first runs Pandas 0.14.0 and Numpy 1.8.1 and the second runs Pandas 0.14.1 and Numpy 1.8.2. This issue is described here: http://stackoverflow.com/questions/25287130/pandas-sort-index-fails-with-multiindex-containing-floats-as-one-level-when-col?noredirect=1#comment39408150_25287130

@jreback
Copy link
Contributor

jreback commented Aug 13, 2014

@8one6

just make the example ONLY the muli_float and just run it on one. simplier/shorter is much better

@8one6
Copy link
Author

8one6 commented Aug 13, 2014

Sure. I'll make the change above. I had thought it was important to show that it worked fine with a float index that was not a multiindex. But no problem. (Done.)

@jreback
Copy link
Contributor

jreback commented Aug 13, 2014

@8one6 thanks, your title says it all though

@jreback jreback changed the title BUG: sort_index fails with numerical level on MultiIndex after columns are added. BUG: sort_index fails with float index level on MultiIndex after columns are added. Aug 13, 2014
@8one6
Copy link
Author

8one6 commented Aug 13, 2014

@jreback Just one note on your title update. The problem is still there if all of the floats in my example are replaced by ints.

@jreback jreback changed the title BUG: sort_index fails with float index level on MultiIndex after columns are added. BUG: sort_index/sortlevel fails MultiIndex after columns are added. Aug 13, 2014
@jreback
Copy link
Contributor

jreback commented Aug 13, 2014

yeh I think has to do with the adding, will look

@jreback jreback added this to the 0.15.0 milestone Aug 13, 2014
@8one6
Copy link
Author

8one6 commented Aug 22, 2014

I think the fix for this will be much deeper in the bowels of Pandas indices than I'm able to handle. Is there any other way I could help toward a patch for this bug?

@jreback
Copy link
Contributor

jreback commented Aug 22, 2014

np. would appreciate a pull-request on any other issue. thanks!

@8one6
Copy link
Author

8one6 commented Sep 15, 2014

In an attempt to get around this issue, I started sticking a character at the end of some of my column names (to turn them from numbers into strings). While the sort_index function seems to work fine when the column names are purely letters, this bug seems to come back when they are a mix of numerals and letters.

import pandas as pd
import numpy as np

np.random.seed(0)
data = np.random.randn(3, 4)
df_all_string_indices = pd.DataFrame(data,
                                   columns=pd.MultiIndex.from_product([['2b', '1b'], ['2c', '1c']], names=['one', 'two']),
                                   index=list('xyz'))

print df_all_string_indices

one        2b                  1b          
two        2c        1c        2c        1c
x    1.764052  0.400157  0.978738  2.240893
y    1.867558 -0.977278  0.950088 -0.151357
z   -0.103219  0.410599  0.144044  1.454274

which looks fine. and sorts fine:

print df_all_string_indices.sort_index(axis=1)

one        1b                  2b          
two        1c        2c        1c        2c
x    2.240893  0.978738  0.400157  1.764052
y   -0.151357  0.950088 -0.977278  1.867558
z    1.454274  0.144044  0.410599 -0.103219

But when I add a new column and try to sort, this still goes wrong:

df_all_string_indices[('1a', 'new')] =  'hello'
print df_all_string_indices.sort_index(axis=1)

one        1b                  2b               1a
two        1c        2c        1c        2c    new
x    2.240893  0.978738  0.400157  1.764052  hello
y   -0.151357  0.950088 -0.977278  1.867558  hello
z    1.454274  0.144044  0.410599 -0.103219  hello

(I.e. I think that after the sort_index, the ('1a', 'new') column should come first.)

@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

@8one6 pls take a look at #8282

this was actually a very strange bug.

In essence, when you add the column it is inserted in the multi-index at the end.

This makes the index no longer lexsorted itself (in fact goes from 2->1 for the lexsort_depth). And in fact, the only way to actually then lexsort it is to reconstrut it in its entirety. I believe this was designed this way to avoid having to do a complete refactorization anytime anything is inserted into a multi-index.

Secondarily, their was a display bug when using FloatIndexes

e.g.

setup

In [1]:         np.random.seed(0)
In [2]:         data = np.random.randn(3,4)
In [3]:         df = DataFrame(data, index=list('def'), columns=MultiIndex.from_tuples([('red', i) for i in [1., 3., 2., 5.]]))
In [4]:         df2 = pd.concat([df,DataFrame('world',
   ...:                                       index=list('def'),
   ...:                                       columns=MultiIndex.from_tuples([('red', 4.0)]))],axis=1)

master

In [5]:  df2
Out[5]: 
        red                                  red
          1         3         2         5      4
d  1.764052  0.400157  0.978738  2.240893  world
e  1.867558 -0.977278  0.950088 -0.151357  world
f -0.103219  0.410599  0.144044  1.454274  world

this PR

In [5]:  df2
Out[5]: 
        red                                     
          1         3         2         5      4
d  1.764052  0.400157  0.978738  2.240893  world
e  1.867558 -0.977278  0.950088 -0.151357  world
f -0.103219  0.410599  0.144044  1.454274  world

@8one6
Copy link
Author

8one6 commented Sep 16, 2014

First off, thank you so much for your time on this. I'll try to give this a test ASAP. Do you think your PR also addresses the version of this issue that I highlighted in the post I put up yesterday (the one immediately before your last post)?

@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

@8one6 yes its the same issue (the printing issue is only with a Float64Index among the levels).

@jorisvandenbossche
Copy link
Member

@jreback Are you sure it is only with a FloatIndex? If I change it to integers in the example above, I have the exact same behaviour

@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

@jorisvandenbossche you are talking about the printing or sorting issue?

@jorisvandenbossche
Copy link
Member

@jreback both

With int columns:

>>> df2
        red                                  red
          1         3         2         5      4
d  0.395664  0.971959 -0.455013 -0.494885  world
e  1.964287 -0.750608 -0.987487 -0.691407  world
f -0.154933  0.630514  0.663228 -1.746483  world
>>> df2.sort_index(axis=1)
        red                        red       red
          1         2         3      4         5
d  0.395664 -0.455013  0.971959  world -0.494885
e  1.964287 -0.987487 -0.750608  world -0.691407
f -0.154933  0.663228  0.630514  world -1.746483
>>> df2.columns.levels[1]
Int64Index([1, 2, 3, 4, 5], dtype='int64')

@jreback
Copy link
Contributor

jreback commented Sep 17, 2014

@jorisvandenbossche this is fixed/tested with all dtypes

@8one6
Copy link
Author

8one6 commented Oct 29, 2014

I think there is still some lingering issue here. I still need to get a MWE up and running, but in the mean time, here is a screenshot showing the issue. I would say that df in its original state is not sorted on the columns. Calling df.sort_index(axis=1) doesn't seem to change anything. But calling df[sorted(df.columns)] does get the columns into the right order.

image

I.e. the thing to notice here is that the first columns in the first two display cells have 'ref' as their 0th position label. But that's not right in a sorted context since there are some ints available in that level of the index.

Just to make sure I'm not going nuts here, can you guys confirm that this looks like a bug and that its worth the effort to put together a MWE to demonstrate from scratch?

@jorisvandenbossche
Copy link
Member

That seems like a possible bug, as this sorts differently/correctly without a multi-index. Can you try to show a small reproducible example showing the issue?

@8one6
Copy link
Author

8one6 commented Oct 29, 2014

I'll have a shot at it. It's odd because I have two DataFrames whose generation is very similar but which don't exhibit parallel behavior in this case. I.e. DF1 comes out of its process sorting just fine, but DF2 (which is the one up above) comes out sorted incorrectly, even though they have very similar structures. One key difference is that DF1 has many more columns than DF2. Not sure if that could be related. Either way, I'll have a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
3 participants