-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Passing multiple levels to stack when having mixed integer/string level names (#8584) #8809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
df = DataFrame(randn(4, 4), columns=columns) | ||
df2 = df.copy() | ||
df2.columns.names = ['exp', 'animal', 0] | ||
# GH #8584: Need to check that stacking works when a number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add blank lines between different blocks
I don't like this soln at all. It IS very overlapping with swaplevels and confusing. I think making this local to the swapping you are doing in reshape is more appropriate. Though to be honest I think using mixed labels/levels when they are not all labels or all position should just raise anyhow. This is just asking for trouble. I think you can upfront validate whether the passed levels are all labels or all positional. If not raise. I think using named integer levels that are not equivalent to the actual level numbers should be raise on MultiIndex construction. Simply avoid this problem. We DON't need a repeat of the |
This already happens and is unchanged. We only accept mixed int/str if the ints are labels, otherwise raise. It's just tricky because as you say you can have an label that is within the range of the level numbers but doesn't correspond to its level number.
This would probably be more sensible long term. Don't know what kind of knock on effects it would have though. If you have an index with levels I'll try to think of a more elegant solution. In the meantime it might be good to get more feedback from @jorisvandenbossche about the motivation behind the bug report. How many people do we think are actually labelling levels with non-matching ints, and why? |
This indeed already happens, eg with all string level names, you get:
To be honest, there was absolutely no motivation behind the bug report :-), apart from curiosity what would happen if I tried mixed string-integer level names after seeing the new feature in the whatsnew entries. So I personally wouldn't use it. But I think that if our spec allows it, it should work correctly (or we should raise a warning that
|
By the way, at the moment, the case with only integer column levels at the 'right' position (@jreback the case that would be allowed in your suggestion) also does not work:
With
With
Only
|
625e2ce
to
6a4f748
Compare
Alright, I've gone back over this and tried to do it in a more sensible way. No more slightly modified swaplevel method, instead we try to only pass level names to swaplevel (this isn't possible in 100% of cases because levels can have no name/ Example output: import pandas as pd
from pandas import DataFrame, MultiIndex
from numpy.random import randn
columns = MultiIndex.from_tuples([('A', 'cat', 'long'), ('B', 'cat', 'long'), ('A', 'dog', 'short'), ('B', 'dog', 'short')],
names=['exp', 'animal', 'hair_length'])
df = DataFrame(randn(4, 4), columns=columns)
df.columns.names = ['exp', 'animal', 1]
print(df.stack(level=['animal', 1]))
exp A B
animal 1
0 cat long 0.642593 -0.178835
dog short 0.532905 -0.255136
1 cat long 1.107472 0.374333
dog short 0.513399 0.176185
2 cat long 0.410521 -0.085423
dog short -0.305200 1.009517
3 cat long -1.772436 -0.819156
dog short -0.923430 0.143579
df.columns.names = ['exp', 'animal', 0]
print(df.stack(level=['animal', 0]))
exp A B
animal 0
0 cat long 0.642593 -0.178835
dog short 0.532905 -0.255136
1 cat long 1.107472 0.374333
dog short 0.513399 0.176185
2 cat long 0.410521 -0.085423
dog short -0.305200 1.009517
3 cat long -1.772436 -0.819156
dog short -0.923430 0.143579
df.columns.names = [0, 1, 2]
df.stack(level=[0,1])
Out[5]:
2 long short
0 1
0 A cat 0.642593 NaN
dog NaN 0.532905
B cat -0.178835 NaN
dog NaN -0.255136
1 A cat 1.107472 NaN
dog NaN 0.513399
B cat 0.374333 NaN
dog NaN 0.176185
2 A cat 0.410521 NaN
dog NaN -0.305200
B cat -0.085423 NaN
dog NaN 1.009517
3 A cat -1.772436 NaN
dog NaN -0.923430
B cat -0.819156 NaN
dog NaN 0.143579
df.stack(level=[0,2])
Out[6]:
1 cat dog
0 2
0 A long 0.642593 NaN
short NaN 0.532905
B long -0.178835 NaN
short NaN -0.255136
1 A long 1.107472 NaN
short NaN 0.513399
B long 0.374333 NaN
short NaN 0.176185
2 A long 0.410521 NaN
short NaN -0.305200
B long -0.085423 NaN
short NaN 1.009517
3 A long -1.772436 NaN
short NaN -0.923430
B long -0.819156 NaN
short NaN 0.143579
df.stack(level=[1,2])
Out[7]:
0 A B
1 2
0 cat long 0.642593 -0.178835
dog short 0.532905 -0.255136
1 cat long 1.107472 0.374333
dog short 0.513399 0.176185
2 cat long 0.410521 -0.085423
dog short -0.305200 1.009517
3 cat long -1.772436 -0.819156
dog short -0.923430 0.143579
# Out of order int level names
df.columns.names = [2, 0, 1]
df.stack(level=[0, 2])
Out[10]:
1 long short
0 2
0 cat A 0.642593 NaN
B -0.178835 NaN
dog A NaN 0.532905
B NaN -0.255136
1 cat A 1.107472 NaN
B 0.374333 NaN
dog A NaN 0.513399
B NaN 0.176185
2 cat A 0.410521 NaN
B -0.085423 NaN
dog A NaN -0.305200
B NaN 1.009517
3 cat A -1.772436 NaN
B -0.819156 NaN
dog A NaN -0.923430
B NaN 0.143579
df.stack(level=[1,2])
Out[11]:
0 cat dog
1 2
0 long A 0.642593 NaN
B -0.178835 NaN
short A NaN 0.532905
B NaN -0.255136
1 long A 1.107472 NaN
B 0.374333 NaN
short A NaN 0.513399
B NaN 0.176185
2 long A 0.410521 NaN
B -0.085423 NaN
short A NaN -0.305200
B NaN 1.009517
3 long A -1.772436 NaN
B -0.819156 NaN
short A NaN -0.923430
B NaN 0.143579 |
# Workaround the edge case where 0 is one of the column names, | ||
# which interferes with trying to sort based on the first | ||
# level | ||
if 0 in this.columns.names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the same? (e.g. if 0 is in the column names then the result is 0?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use the same _convert_level_number()
function to make sure we get names[0]
in case of a potential conflict, if that's what you mean. This is the section that was raising the lexsort depth error in the original bug report, which is sort of separate from stacking the wrong levels.
Want me to replace the if/else with the _convert_level_number()
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, try that
6a4f748
to
1e8a133
Compare
OK, using |
looks good! ok, give a nice squash into a single commit and i think good to go. |
Add test case for mixed type stacking Used wrong var name in the assert Method to swap levels assuming ints are level numbers Fix _stack_multi_columns to deal with mixed strs/ints Extra testcases Add fix to the release notes Convert to label before swaplevel if possible Revert "Method to swap levels assuming ints are level numbers" This reverts commit 61f96fd3cb23cda9f9c7a6837b145ebd247a55cc. More test cases Use _convert_level_number() to sort columns
1e8a133
to
4ae90ae
Compare
Squashed and green, should be good to go. |
@jorisvandenbossche loooks good to me, merge when ready |
BUG: Passing multiple levels to stack when having mixed integer/string level names (#8584)
@onesandzeroes Thanks a lot! @jreback about the integer level names issue we discussed above, I think is a more general problem than this PR, as eg if you do |
closes #8584
Stacking should now work as expected for the cases described in the bug report (#8584). So we can pass a list of both ints and strs to
DataFrame.stack
, provided the ints are level names. Mixed ints and strs when the ints aren't level names raisesValueError
, but that's unchanged.I've added a new hidden method
_swaplevel_assume_numbers
since the main cause of the bug wasambiguity between ints as names and ints as level numbers, and this allows us to make sure the
swap is done correctly once we reach that point.
Another alternative would have been to add an
as_numbers=True
flag to the existingswaplevel
method. We could keep existing behaviour intact by having the default beas_numbers=False
. If thatseems like the better option now that you've seen the PR, let me know and it should be pretty easy
to switch.