BUG: Passing multiple levels to stack when having mixed integer/string level names (#8584) #8809

onesandzeroes · 2014-11-14T00:18:43Z

Stacking should now work as expected for the cases described in the bug report (#8584). So we can pass a list of both ints and strs to DataFrame.stack, provided the ints are level names. Mixed ints and strs when the ints aren't level names raises ValueError, but that's unchanged.

I've added a new hidden method _swaplevel_assume_numbers since the main cause of the bug was
ambiguity between ints as names and ints as level numbers, and this allows us to make sure the
swap is done correctly once we reach that point.

Another alternative would have been to add an as_numbers=True flag to the existing swaplevel method. We could keep existing behaviour intact by having the default be as_numbers=False. If that
seems like the better option now that you've seen the PR, let me know and it should be pretty easy
to switch.

import pandas as pd
from pandas import DataFrame, MultiIndex
from numpy.random import randn

columns = MultiIndex.from_tuples([('A', 'cat', 'long'), ('B', 'cat', 'long'), ('A', 'dog', 'short'), ('B', 'dog', 'short')],
                                 names=['exp', 'animal', 'hair_length'])
df = DataFrame(randn(4, 4), columns=columns)

df.columns.names = ['exp', 'animal', 1]
print(df.stack(level=['animal', 1]))

# exp                    A         B
#   animal 1                        
#0 cat    long   0.758850  0.327957
#   dog    short  0.429752  0.236901
#1 cat    long   0.572633  1.057861
#   dog    short  0.802539  1.321191
#2 cat    long   0.283352  0.088000
#   dog    short -0.814458 -1.688529
#3 cat    long   2.009912 -1.738292
#   dog    short  1.272569 -0.133434

df.columns.names = ['exp', 'animal', 0]
print(df.stack(level=['animal', 0]))

# exp                    A         B
#   animal 0                        
#0 cat    long   0.758850  0.327957
#   dog    short  0.429752  0.236901
#1 cat    long   0.572633  1.057861
#   dog    short  0.802539  1.321191
#2 cat    long   0.283352  0.088000
#   dog    short -0.814458 -1.688529
#3 cat    long   2.009912 -1.738292
#   dog    short  1.272569 -0.133434

jreback · 2014-11-14T12:52:21Z

pandas/tests/test_frame.py

+        df = DataFrame(randn(4, 4), columns=columns)
+        df2 = df.copy()
+        df2.columns.names = ['exp', 'animal', 0]
+        # GH #8584: Need to check that stacking works when a number


add blank lines between different blocks

jreback · 2014-11-14T12:59:04Z

I don't like this soln at all. It IS very overlapping with swaplevels and confusing. I think making this local to the swapping you are doing in reshape is more appropriate. Though to be honest I think using mixed labels/levels when they are not all labels or all position should just raise anyhow. This is just asking for trouble.

I think you can upfront validate whether the passed levels are all labels or all positional. If not raise.

I think using named integer levels that are not equivalent to the actual level numbers should be raise on MultiIndex construction. Simply avoid this problem. We DON't need a repeat of the .ix behavior here.

onesandzeroes · 2014-11-14T22:45:39Z

Though to be honest I think using mixed labels/levels when they are not all labels or all position should just raise anyhow.

I think you can upfront validate whether the passed levels are all labels or all positional. If not raise.

This already happens and is unchanged. We only accept mixed int/str if the ints are labels, otherwise raise. It's just tricky because as you say you can have an label that is within the range of the level numbers but doesn't correspond to its level number.

I think using named integer levels that are not equivalent to the actual level numbers should be raise on MultiIndex construction.

This would probably be more sensible long term. Don't know what kind of knock on effects it would have though. If you have an index with levels [0, 1] does that means you can't pull the 0 level out of the index, or you can but the 1 level automatically changes label?

I'll try to think of a more elegant solution. In the meantime it might be good to get more feedback from @jorisvandenbossche about the motivation behind the bug report. How many people do we think are actually labelling levels with non-matching ints, and why?

jorisvandenbossche · 2014-11-15T10:53:32Z

Though to be honest I think using mixed labels/levels when they are not all labels or all position should just raise anyhow.

I think you can upfront validate whether the passed levels are all labels or all positional. If not raise.

This already happens and is unchanged. We only accept mixed int/str if the ints are labels, otherwise raise.

This indeed already happens, eg with all string level names, you get:

In [154]: df.stack(level=['animal', 2])
...
ValueError: level should contain all level names or all level numbers, not a mixture of the two.

get more feedback from @jorisvandenbossche about the motivation behind the bug report. How many people do we think are actually labelling levels with non-matching ints, and why?

To be honest, there was absolutely no motivation behind the bug report :-), apart from curiosity what would happen if I tried mixed string-integer level names after seeing the new feature in the whatsnew entries. So I personally wouldn't use it. But I think that if our spec allows it, it should work correctly (or we should raise a warning that stack will not work correctly when we detect a case where an integer level name does not match its position).

I think using named integer levels that are not equivalent to the actual level numbers should be raise on MultiIndex construction.

I think this is a bit difficult, as during operations the level names can change/level order can change. Eg when having level names [0, 1] and then doing a swaplevel -> raise?, or when doing a stack removing one of the levels also changes the position -> raise?
Another option, is to just disallow integer level names, and restrict the level names to strings. This would solve any ambiguity in these cases.
The problem with this is that level names often originate from column names (so index values), and of course we allow almost any object to be an index value. So this would lead to difficulties/inconsistencies in that regard.

jorisvandenbossche · 2014-11-15T11:00:53Z

By the way, at the moment, the case with only integer column levels at the 'right' position (@jreback the case that would be allowed in your suggestion) also does not work:

In [168]: columns = MultiIndex.from_tuples([('A', 'cat', 'long'), ('B', 'cat', 'long'), ('A', 'dog', 'short'), ('B', 'dog', 'short')],
   .....:                                  names=['exp', 'animal', 'hair_length'])

In [169]: df = DataFrame(randn(4, 4), columns=columns)

In [170]: df.columns.names = [0, 1, 2]

In [171]: df
Out[171]:
0         A         B         A         B
1       cat       cat       dog       dog
2      long      long     short     short
0 -0.978207  0.442300  1.952685 -0.909283
1 -0.121663  0.441968 -0.232617 -0.139095
2  0.346387 -1.027363  0.851873  1.242616
3  0.872214  0.138330 -2.902060 -0.220452

With [0, 1], level 0 is repeated, level 1 disappeared:

In [172]: df.stack(level=[0,1])
Out[172]:
2          long     short
  0 0
0 A A -0.978207       NaN
    B  0.442300       NaN
  B A       NaN  1.952685
    B       NaN -0.909283
1 A A -0.121663       NaN
    B  0.441968       NaN
  B A       NaN -0.232617
    B       NaN -0.139095
2 A A  0.346387       NaN
    B -1.027363       NaN
  B A       NaN  0.851873
    B       NaN  1.242616
3 A A  0.872214       NaN
    B  0.138330       NaN
  B A       NaN -2.902060
    B       NaN -0.220452

With [0, 2] also level 1 disappeared:

In [173]: df.stack(level=[0,2])
Out[173]:
2              long     short
  0 2
0 A long  -0.978207       NaN
    short  0.442300       NaN
  B long        NaN  1.952685
    short       NaN -0.909283
1 A long  -0.121663       NaN
    short  0.441968       NaN
  B long        NaN -0.232617
    short       NaN -0.139095
2 A long   0.346387       NaN
    short -1.027363       NaN
  B long        NaN  0.851873
    short       NaN  1.242616
3 A long   0.872214       NaN
    short  0.138330       NaN
  B long        NaN -2.902060
    short       NaN -0.220452

Only [1, 2] is working correctly:

In [174]: df.stack(level=[1,2])
Out[174]:
0                   A         B
  1   2
0 cat long  -0.978207  0.442300
  dog short  1.952685 -0.909283
1 cat long  -0.121663  0.441968
  dog short -0.232617 -0.139095
2 cat long   0.346387 -1.027363
  dog short  0.851873  1.242616
3 cat long   0.872214  0.138330
  dog short -2.902060 -0.220452

onesandzeroes · 2014-11-16T03:25:08Z

Alright, I've gone back over this and tried to do it in a more sensible way. No more slightly modified swaplevel method, instead we try to only pass level names to swaplevel (this isn't possible in 100% of cases because levels can have no name/None as their label). Should work for both mixed int/str labels and when all level names are ints. Test cases for both alternatives are included.

Example output:

import pandas as pd
from pandas import DataFrame, MultiIndex
from numpy.random import randn

columns = MultiIndex.from_tuples([('A', 'cat', 'long'), ('B', 'cat', 'long'), ('A', 'dog', 'short'), ('B', 'dog', 'short')],
                                 names=['exp', 'animal', 'hair_length'])
df = DataFrame(randn(4, 4), columns=columns)

df.columns.names = ['exp', 'animal', 1]
print(df.stack(level=['animal', 1]))
exp                    A         B
  animal 1                        
0 cat    long   0.642593 -0.178835
  dog    short  0.532905 -0.255136
1 cat    long   1.107472  0.374333
  dog    short  0.513399  0.176185
2 cat    long   0.410521 -0.085423
  dog    short -0.305200  1.009517
3 cat    long  -1.772436 -0.819156
  dog    short -0.923430  0.143579

df.columns.names = ['exp', 'animal', 0]
print(df.stack(level=['animal', 0]))
exp                    A         B
  animal 0                        
0 cat    long   0.642593 -0.178835
  dog    short  0.532905 -0.255136
1 cat    long   1.107472  0.374333
  dog    short  0.513399  0.176185
2 cat    long   0.410521 -0.085423
  dog    short -0.305200  1.009517
3 cat    long  -1.772436 -0.819156
  dog    short -0.923430  0.143579

df.columns.names = [0, 1, 2]
df.stack(level=[0,1])
Out[5]: 
2            long     short
  0 1                      
0 A cat  0.642593       NaN
    dog       NaN  0.532905
  B cat -0.178835       NaN
    dog       NaN -0.255136
1 A cat  1.107472       NaN
    dog       NaN  0.513399
  B cat  0.374333       NaN
    dog       NaN  0.176185
2 A cat  0.410521       NaN
    dog       NaN -0.305200
  B cat -0.085423       NaN
    dog       NaN  1.009517
3 A cat -1.772436       NaN
    dog       NaN -0.923430
  B cat -0.819156       NaN
    dog       NaN  0.143579

df.stack(level=[0,2])
Out[6]: 
1               cat       dog
  0 2                        
0 A long   0.642593       NaN
    short       NaN  0.532905
  B long  -0.178835       NaN
    short       NaN -0.255136
1 A long   1.107472       NaN
    short       NaN  0.513399
  B long   0.374333       NaN
    short       NaN  0.176185
2 A long   0.410521       NaN
    short       NaN -0.305200
  B long  -0.085423       NaN
    short       NaN  1.009517
3 A long  -1.772436       NaN
    short       NaN -0.923430
  B long  -0.819156       NaN
    short       NaN  0.143579

df.stack(level=[1,2])
Out[7]: 
0                   A         B
  1   2                        
0 cat long   0.642593 -0.178835
  dog short  0.532905 -0.255136
1 cat long   1.107472  0.374333
  dog short  0.513399  0.176185
2 cat long   0.410521 -0.085423
  dog short -0.305200  1.009517
3 cat long  -1.772436 -0.819156
  dog short -0.923430  0.143579

# Out of order int level names
df.columns.names = [2, 0, 1]

df.stack(level=[0, 2])
Out[10]: 
1            long     short
  0   2                    
0 cat A  0.642593       NaN
      B -0.178835       NaN
  dog A       NaN  0.532905
      B       NaN -0.255136
1 cat A  1.107472       NaN
      B  0.374333       NaN
  dog A       NaN  0.513399
      B       NaN  0.176185
2 cat A  0.410521       NaN
      B -0.085423       NaN
  dog A       NaN -0.305200
      B       NaN  1.009517
3 cat A -1.772436       NaN
      B -0.819156       NaN
  dog A       NaN -0.923430
      B       NaN  0.143579

df.stack(level=[1,2])
Out[11]: 
0               cat       dog
  1     2                    
0 long  A  0.642593       NaN
        B -0.178835       NaN
  short A       NaN  0.532905
        B       NaN -0.255136
1 long  A  1.107472       NaN
        B  0.374333       NaN
  short A       NaN  0.513399
        B       NaN  0.176185
2 long  A  0.410521       NaN
        B -0.085423       NaN
  short A       NaN -0.305200
        B       NaN  1.009517
3 long  A -1.772436       NaN
        B -0.819156       NaN
  short A       NaN -0.923430
        B       NaN  0.143579

jreback · 2014-11-16T14:48:37Z

pandas/core/reshape.py

+        # Workaround the edge case where 0 is one of the column names,
+        # which interferes with trying to sort based on the first
+        # level
+        if 0 in this.columns.names:


isn't this the same? (e.g. if 0 is in the column names then the result is 0?)

We could use the same _convert_level_number() function to make sure we get names[0] in case of a potential conflict, if that's what you mean. This is the section that was raising the lexsort depth error in the original bug report, which is sort of separate from stacking the wrong levels.

Want me to replace the if/else with the _convert_level_number() function?

yes, try that

onesandzeroes · 2014-11-17T03:55:17Z

OK, using _convert_level_number() now to find the right level to sort. The build errored out and as far as I can tell I don't have permissions to force a rebuild. Not sure how to do it without making changes. If this looks OK I can squash and re-upload which should run the Travis build again.

jreback · 2014-11-17T11:11:52Z

looks good!

ok, give a nice squash into a single commit and i think good to go.

Add test case for mixed type stacking Used wrong var name in the assert Method to swap levels assuming ints are level numbers Fix _stack_multi_columns to deal with mixed strs/ints Extra testcases Add fix to the release notes Convert to label before swaplevel if possible Revert "Method to swap levels assuming ints are level numbers" This reverts commit 61f96fd3cb23cda9f9c7a6837b145ebd247a55cc. More test cases Use _convert_level_number() to sort columns

onesandzeroes · 2014-11-17T12:40:50Z

Squashed and green, should be good to go.

jreback · 2014-11-17T12:44:14Z

@jorisvandenbossche loooks good to me, merge when ready

BUG: Passing multiple levels to stack when having mixed integer/string level names (#8584)

jorisvandenbossche · 2014-11-17T15:25:57Z

@onesandzeroes Thanks a lot!

@jreback about the integer level names issue we discussed above, I think is a more general problem than this PR, as eg if you do swaplevel on a multi-index with integer level names, it already first looks at the names and not the position (so we have already some .ix-like behaviour ..)

jreback reviewed Nov 14, 2014
View reviewed changes

jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 14, 2014

jreback added this to the 0.15.2 milestone Nov 14, 2014

onesandzeroes force-pushed the stackfix branch from 625e2ce to 6a4f748 Compare November 16, 2014 03:13

jreback reviewed Nov 16, 2014
View reviewed changes

onesandzeroes force-pushed the stackfix branch from 6a4f748 to 1e8a133 Compare November 17, 2014 02:36

onesandzeroes force-pushed the stackfix branch from 1e8a133 to 4ae90ae Compare November 17, 2014 11:49

jorisvandenbossche added a commit that referenced this pull request Nov 17, 2014

Merge pull request #8809 from onesandzeroes/stackfix

0f899f4

BUG: Passing multiple levels to stack when having mixed integer/string level names (#8584)

jorisvandenbossche merged commit 0f899f4 into pandas-dev:master Nov 17, 2014

WillAyd mentioned this pull request May 8, 2018

Data is mismatched with labels after stack with MultiIndex columns #20945

Closed

jorisvandenbossche mentioned this pull request Jun 29, 2018

API: unclear what integer level name references: name or position? #21677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Passing multiple levels to stack when having mixed integer/string level names (#8584) #8809

BUG: Passing multiple levels to stack when having mixed integer/string level names (#8584) #8809

onesandzeroes commented Nov 14, 2014

jreback Nov 14, 2014

jreback commented Nov 14, 2014

onesandzeroes commented Nov 14, 2014

jorisvandenbossche commented Nov 15, 2014

jorisvandenbossche commented Nov 15, 2014

onesandzeroes commented Nov 16, 2014

jreback Nov 16, 2014

onesandzeroes Nov 16, 2014

jreback Nov 17, 2014

onesandzeroes commented Nov 17, 2014

jreback commented Nov 17, 2014

onesandzeroes commented Nov 17, 2014

jreback commented Nov 17, 2014

jorisvandenbossche commented Nov 17, 2014

BUG: Passing multiple levels to stack when having mixed integer/string level names (#8584) #8809

BUG: Passing multiple levels to stack when having mixed integer/string level names (#8584) #8809

Conversation

onesandzeroes commented Nov 14, 2014

jreback Nov 14, 2014

Choose a reason for hiding this comment

jreback commented Nov 14, 2014

onesandzeroes commented Nov 14, 2014

jorisvandenbossche commented Nov 15, 2014

jorisvandenbossche commented Nov 15, 2014

onesandzeroes commented Nov 16, 2014

jreback Nov 16, 2014

Choose a reason for hiding this comment

onesandzeroes Nov 16, 2014

Choose a reason for hiding this comment

jreback Nov 17, 2014

Choose a reason for hiding this comment

onesandzeroes commented Nov 17, 2014

jreback commented Nov 17, 2014

onesandzeroes commented Nov 17, 2014

jreback commented Nov 17, 2014

jorisvandenbossche commented Nov 17, 2014