Skip to content

BUG: Passing multiple levels to stack when having mixed integer/string level names (#8584) #8809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 17, 2014

Conversation

onesandzeroes
Copy link
Contributor

closes #8584

Stacking should now work as expected for the cases described in the bug report (#8584). So we can pass a list of both ints and strs to DataFrame.stack, provided the ints are level names. Mixed ints and strs when the ints aren't level names raises ValueError, but that's unchanged.

I've added a new hidden method _swaplevel_assume_numbers since the main cause of the bug was
ambiguity between ints as names and ints as level numbers, and this allows us to make sure the
swap is done correctly once we reach that point.

Another alternative would have been to add an as_numbers=True flag to the existing swaplevel method. We could keep existing behaviour intact by having the default be as_numbers=False. If that
seems like the better option now that you've seen the PR, let me know and it should be pretty easy
to switch.

import pandas as pd
from pandas import DataFrame, MultiIndex
from numpy.random import randn

columns = MultiIndex.from_tuples([('A', 'cat', 'long'), ('B', 'cat', 'long'), ('A', 'dog', 'short'), ('B', 'dog', 'short')],
                                 names=['exp', 'animal', 'hair_length'])
df = DataFrame(randn(4, 4), columns=columns)

df.columns.names = ['exp', 'animal', 1]
print(df.stack(level=['animal', 1]))

# exp                    A         B
#   animal 1                        
#0 cat    long   0.758850  0.327957
#   dog    short  0.429752  0.236901
#1 cat    long   0.572633  1.057861
#   dog    short  0.802539  1.321191
#2 cat    long   0.283352  0.088000
#   dog    short -0.814458 -1.688529
#3 cat    long   2.009912 -1.738292
#   dog    short  1.272569 -0.133434

df.columns.names = ['exp', 'animal', 0]
print(df.stack(level=['animal', 0]))

# exp                    A         B
#   animal 0                        
#0 cat    long   0.758850  0.327957
#   dog    short  0.429752  0.236901
#1 cat    long   0.572633  1.057861
#   dog    short  0.802539  1.321191
#2 cat    long   0.283352  0.088000
#   dog    short -0.814458 -1.688529
#3 cat    long   2.009912 -1.738292
#   dog    short  1.272569 -0.133434

df = DataFrame(randn(4, 4), columns=columns)
df2 = df.copy()
df2.columns.names = ['exp', 'animal', 0]
# GH #8584: Need to check that stacking works when a number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add blank lines between different blocks

@jreback
Copy link
Contributor

jreback commented Nov 14, 2014

I don't like this soln at all. It IS very overlapping with swaplevels and confusing. I think making this local to the swapping you are doing in reshape is more appropriate. Though to be honest I think using mixed labels/levels when they are not all labels or all position should just raise anyhow. This is just asking for trouble.

I think you can upfront validate whether the passed levels are all labels or all positional. If not raise.

I think using named integer levels that are not equivalent to the actual level numbers should be raise on MultiIndex construction. Simply avoid this problem. We DON't need a repeat of the .ix behavior here.

@jreback jreback added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 14, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 14, 2014
@onesandzeroes
Copy link
Contributor Author

Though to be honest I think using mixed labels/levels when they are not all labels or all position should just raise anyhow.

I think you can upfront validate whether the passed levels are all labels or all positional. If not raise.

This already happens and is unchanged. We only accept mixed int/str if the ints are labels, otherwise raise. It's just tricky because as you say you can have an label that is within the range of the level numbers but doesn't correspond to its level number.

I think using named integer levels that are not equivalent to the actual level numbers should be raise on MultiIndex construction.

This would probably be more sensible long term. Don't know what kind of knock on effects it would have though. If you have an index with levels [0, 1] does that means you can't pull the 0 level out of the index, or you can but the 1 level automatically changes label?

I'll try to think of a more elegant solution. In the meantime it might be good to get more feedback from @jorisvandenbossche about the motivation behind the bug report. How many people do we think are actually labelling levels with non-matching ints, and why?

@jorisvandenbossche
Copy link
Member

Though to be honest I think using mixed labels/levels when they are not all labels or all position should just raise anyhow.

I think you can upfront validate whether the passed levels are all labels or all positional. If not raise.

This already happens and is unchanged. We only accept mixed int/str if the ints are labels, otherwise raise.

This indeed already happens, eg with all string level names, you get:

In [154]: df.stack(level=['animal', 2])
...
ValueError: level should contain all level names or all level numbers, not a mixture of the two.

get more feedback from @jorisvandenbossche about the motivation behind the bug report. How many people do we think are actually labelling levels with non-matching ints, and why?

To be honest, there was absolutely no motivation behind the bug report :-), apart from curiosity what would happen if I tried mixed string-integer level names after seeing the new feature in the whatsnew entries. So I personally wouldn't use it. But I think that if our spec allows it, it should work correctly (or we should raise a warning that stack will not work correctly when we detect a case where an integer level name does not match its position).

I think using named integer levels that are not equivalent to the actual level numbers should be raise on MultiIndex construction.

  • I think this is a bit difficult, as during operations the level names can change/level order can change. Eg when having level names [0, 1] and then doing a swaplevel -> raise?, or when doing a stack removing one of the levels also changes the position -> raise?
  • Another option, is to just disallow integer level names, and restrict the level names to strings. This would solve any ambiguity in these cases.
    The problem with this is that level names often originate from column names (so index values), and of course we allow almost any object to be an index value. So this would lead to difficulties/inconsistencies in that regard.

@jorisvandenbossche
Copy link
Member

By the way, at the moment, the case with only integer column levels at the 'right' position (@jreback the case that would be allowed in your suggestion) also does not work:

In [168]: columns = MultiIndex.from_tuples([('A', 'cat', 'long'), ('B', 'cat', 'long'), ('A', 'dog', 'short'), ('B', 'dog', 'short')],
   .....:                                  names=['exp', 'animal', 'hair_length'])

In [169]: df = DataFrame(randn(4, 4), columns=columns)

In [170]: df.columns.names = [0, 1, 2]

In [171]: df
Out[171]:
0         A         B         A         B
1       cat       cat       dog       dog
2      long      long     short     short
0 -0.978207  0.442300  1.952685 -0.909283
1 -0.121663  0.441968 -0.232617 -0.139095
2  0.346387 -1.027363  0.851873  1.242616
3  0.872214  0.138330 -2.902060 -0.220452

With [0, 1], level 0 is repeated, level 1 disappeared:

In [172]: df.stack(level=[0,1])
Out[172]:
2          long     short
  0 0
0 A A -0.978207       NaN
    B  0.442300       NaN
  B A       NaN  1.952685
    B       NaN -0.909283
1 A A -0.121663       NaN
    B  0.441968       NaN
  B A       NaN -0.232617
    B       NaN -0.139095
2 A A  0.346387       NaN
    B -1.027363       NaN
  B A       NaN  0.851873
    B       NaN  1.242616
3 A A  0.872214       NaN
    B  0.138330       NaN
  B A       NaN -2.902060
    B       NaN -0.220452

With [0, 2] also level 1 disappeared:

In [173]: df.stack(level=[0,2])
Out[173]:
2              long     short
  0 2
0 A long  -0.978207       NaN
    short  0.442300       NaN
  B long        NaN  1.952685
    short       NaN -0.909283
1 A long  -0.121663       NaN
    short  0.441968       NaN
  B long        NaN -0.232617
    short       NaN -0.139095
2 A long   0.346387       NaN
    short -1.027363       NaN
  B long        NaN  0.851873
    short       NaN  1.242616
3 A long   0.872214       NaN
    short  0.138330       NaN
  B long        NaN -2.902060
    short       NaN -0.220452

Only [1, 2] is working correctly:

In [174]: df.stack(level=[1,2])
Out[174]:
0                   A         B
  1   2
0 cat long  -0.978207  0.442300
  dog short  1.952685 -0.909283
1 cat long  -0.121663  0.441968
  dog short -0.232617 -0.139095
2 cat long   0.346387 -1.027363
  dog short  0.851873  1.242616
3 cat long   0.872214  0.138330
  dog short -2.902060 -0.220452

@onesandzeroes
Copy link
Contributor Author

Alright, I've gone back over this and tried to do it in a more sensible way. No more slightly modified swaplevel method, instead we try to only pass level names to swaplevel (this isn't possible in 100% of cases because levels can have no name/None as their label). Should work for both mixed int/str labels and when all level names are ints. Test cases for both alternatives are included.

Example output:

import pandas as pd
from pandas import DataFrame, MultiIndex
from numpy.random import randn

columns = MultiIndex.from_tuples([('A', 'cat', 'long'), ('B', 'cat', 'long'), ('A', 'dog', 'short'), ('B', 'dog', 'short')],
                                 names=['exp', 'animal', 'hair_length'])
df = DataFrame(randn(4, 4), columns=columns)

df.columns.names = ['exp', 'animal', 1]
print(df.stack(level=['animal', 1]))
exp                    A         B
  animal 1                        
0 cat    long   0.642593 -0.178835
  dog    short  0.532905 -0.255136
1 cat    long   1.107472  0.374333
  dog    short  0.513399  0.176185
2 cat    long   0.410521 -0.085423
  dog    short -0.305200  1.009517
3 cat    long  -1.772436 -0.819156
  dog    short -0.923430  0.143579

df.columns.names = ['exp', 'animal', 0]
print(df.stack(level=['animal', 0]))
exp                    A         B
  animal 0                        
0 cat    long   0.642593 -0.178835
  dog    short  0.532905 -0.255136
1 cat    long   1.107472  0.374333
  dog    short  0.513399  0.176185
2 cat    long   0.410521 -0.085423
  dog    short -0.305200  1.009517
3 cat    long  -1.772436 -0.819156
  dog    short -0.923430  0.143579

df.columns.names = [0, 1, 2]
df.stack(level=[0,1])
Out[5]: 
2            long     short
  0 1                      
0 A cat  0.642593       NaN
    dog       NaN  0.532905
  B cat -0.178835       NaN
    dog       NaN -0.255136
1 A cat  1.107472       NaN
    dog       NaN  0.513399
  B cat  0.374333       NaN
    dog       NaN  0.176185
2 A cat  0.410521       NaN
    dog       NaN -0.305200
  B cat -0.085423       NaN
    dog       NaN  1.009517
3 A cat -1.772436       NaN
    dog       NaN -0.923430
  B cat -0.819156       NaN
    dog       NaN  0.143579

df.stack(level=[0,2])
Out[6]: 
1               cat       dog
  0 2                        
0 A long   0.642593       NaN
    short       NaN  0.532905
  B long  -0.178835       NaN
    short       NaN -0.255136
1 A long   1.107472       NaN
    short       NaN  0.513399
  B long   0.374333       NaN
    short       NaN  0.176185
2 A long   0.410521       NaN
    short       NaN -0.305200
  B long  -0.085423       NaN
    short       NaN  1.009517
3 A long  -1.772436       NaN
    short       NaN -0.923430
  B long  -0.819156       NaN
    short       NaN  0.143579

df.stack(level=[1,2])
Out[7]: 
0                   A         B
  1   2                        
0 cat long   0.642593 -0.178835
  dog short  0.532905 -0.255136
1 cat long   1.107472  0.374333
  dog short  0.513399  0.176185
2 cat long   0.410521 -0.085423
  dog short -0.305200  1.009517
3 cat long  -1.772436 -0.819156
  dog short -0.923430  0.143579

# Out of order int level names
df.columns.names = [2, 0, 1]

df.stack(level=[0, 2])
Out[10]: 
1            long     short
  0   2                    
0 cat A  0.642593       NaN
      B -0.178835       NaN
  dog A       NaN  0.532905
      B       NaN -0.255136
1 cat A  1.107472       NaN
      B  0.374333       NaN
  dog A       NaN  0.513399
      B       NaN  0.176185
2 cat A  0.410521       NaN
      B -0.085423       NaN
  dog A       NaN -0.305200
      B       NaN  1.009517
3 cat A -1.772436       NaN
      B -0.819156       NaN
  dog A       NaN -0.923430
      B       NaN  0.143579

df.stack(level=[1,2])
Out[11]: 
0               cat       dog
  1     2                    
0 long  A  0.642593       NaN
        B -0.178835       NaN
  short A       NaN  0.532905
        B       NaN -0.255136
1 long  A  1.107472       NaN
        B  0.374333       NaN
  short A       NaN  0.513399
        B       NaN  0.176185
2 long  A  0.410521       NaN
        B -0.085423       NaN
  short A       NaN -0.305200
        B       NaN  1.009517
3 long  A -1.772436       NaN
        B -0.819156       NaN
  short A       NaN -0.923430
        B       NaN  0.143579

# Workaround the edge case where 0 is one of the column names,
# which interferes with trying to sort based on the first
# level
if 0 in this.columns.names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the same? (e.g. if 0 is in the column names then the result is 0?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use the same _convert_level_number() function to make sure we get names[0] in case of a potential conflict, if that's what you mean. This is the section that was raising the lexsort depth error in the original bug report, which is sort of separate from stacking the wrong levels.

Want me to replace the if/else with the _convert_level_number() function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, try that

@onesandzeroes
Copy link
Contributor Author

OK, using _convert_level_number() now to find the right level to sort. The build errored out and as far as I can tell I don't have permissions to force a rebuild. Not sure how to do it without making changes. If this looks OK I can squash and re-upload which should run the Travis build again.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2014

looks good!

ok, give a nice squash into a single commit and i think good to go.

Add test case for mixed type stacking

Used wrong var name in the assert

Method to swap levels assuming ints are level numbers

Fix _stack_multi_columns to deal with mixed strs/ints

Extra testcases

Add fix to the release notes

Convert to label before swaplevel if possible

Revert "Method to swap levels assuming ints are level numbers"

This reverts commit 61f96fd3cb23cda9f9c7a6837b145ebd247a55cc.

More test cases

Use _convert_level_number() to sort columns
@onesandzeroes
Copy link
Contributor Author

Squashed and green, should be good to go.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2014

@jorisvandenbossche loooks good to me, merge when ready

jorisvandenbossche added a commit that referenced this pull request Nov 17, 2014
BUG: Passing multiple levels to stack when having mixed integer/string level names (#8584)
@jorisvandenbossche jorisvandenbossche merged commit 0f899f4 into pandas-dev:master Nov 17, 2014
@jorisvandenbossche
Copy link
Member

@onesandzeroes Thanks a lot!

@jreback about the integer level names issue we discussed above, I think is a more general problem than this PR, as eg if you do swaplevel on a multi-index with integer level names, it already first looks at the names and not the position (so we have already some .ix-like behaviour ..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Passing multiple levels to stack when having mixed integer/string level names
3 participants