Skip to content

DEPR: DataFrame.stack on columns containing duplicate values #53761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rhshadrach opened this issue Jun 21, 2023 · 10 comments
Closed

DEPR: DataFrame.stack on columns containing duplicate values #53761

rhshadrach opened this issue Jun 21, 2023 · 10 comments
Labels
Bug Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@rhshadrach
Copy link
Member

rhshadrach commented Jun 21, 2023

With MultiIndex columns, we get incorrect results

columns = pd.MultiIndex(
    levels=[["a", "b"], ["x", "y"]],
    codes=[[0, 1, 0], [0, 1, 0]],
    names=["l1", "l2"],
)
df = pd.DataFrame(np.arange(6).reshape((2, 3)), columns=columns)
print(df)
# l1  a  b  a
# l2  x  y  x
# 0   0  1  2
# 1   3  4  5
  
print(df.stack(0))
# l2    x    y
#   l1
# 0 a   0  NaN
#   b   2  1.0
# 1 a   3  NaN
#   b   5  4.0

In particular, the value of df indexed by (0, (a, x)) is 2, and this gets moved to the value indexed by ((0, b), x).

Taking the same example but with an Index gives a more reasonable result:

df = df.droplevel(1, axis=1)
print(df)
# l1  a  b  a
# 0   0  1  2
# 1   3  4  5

print(df.stack(0))
#    l1
# 0  a     0
#    b     1
#    a     2
# 1  a     3
#    b     4
#    a     5
# dtype: int64

However, I think this is still inconsistent with the rest of stack: it's the only case where the values in the index coming from the columns have duplicate values.

Accessing particular subsets of a DataFrame when the columns have duplicate values is fraught with difficulty. I think we should deprecate supporting duplicate values and in the future raise instead.

@rhshadrach rhshadrach added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Jun 21, 2023
@rhshadrach
Copy link
Member Author

rhshadrach commented Jun 23, 2023

@mroeschke @phofl @jbrockmendel - does this seem like a reasonable restriction for using DataFrame.stack?

@rhshadrach rhshadrach changed the title DEPR: DataFrame.stack with on columns containing duplicate values DEPR: DataFrame.stack on columns containing duplicate values Jun 23, 2023
@jorisvandenbossche
Copy link
Member

Would you only deprecate this for the MultiIndex case? Or for both cases?

And do you have any idea how hard it would be to fix the bug instead for the MultiIndex case?

(I am personally fine with deprecating it, but on the other hand we have many places in pandas where we support duplicate indices, even if we never recommend this for users to actually do this. So if it is fixable without too much complexity (there is no ambiguity in this case, so it could work), that could also be more consistent)

@rhshadrach
Copy link
Member Author

For the new implementation (#53756), I worked at it for quite a bit before giving up. But I'd like to go back now and take another stab at it. It seems theoretically possible. I'll report back in a few days.

@jorisvandenbossche
Copy link
Member

Looking at your new implementation (for the first example with a MultiIndex above), this might indeed not be that straightforward: in the end we are doing a concat([]) of the subsetted parts, and if those don't have exactly the same column labels, concat does a reindex, and reindexing doesn't support duplicate labels ..

@rhshadrach
Copy link
Member Author

The idea I was playing around with before is to take each of the .loc results that need to be concat'ed, replace any duplicate column labels with distinct labels, concat, and then undo the replacement. But it was getting complicated and there were lots of other issues to be resolved at the time. I still think this may be a viable approach.

@rhshadrach
Copy link
Member Author

rhshadrach commented Jul 22, 2023

@jorisvandenbossche @mroeschke I'm looking more into this issue. What's the expected output here:

columns = pd.MultiIndex(
    levels=[["a", "b"], ["x", "y", "z"]],
    codes=[[0, 1, 0, 0, 0], [0, 0, 0, 1, 1]],
    names=["l1", "l2"],
)
df = pd.DataFrame(np.arange(10).reshape((2, 5)), columns=columns)

print(df)
# l1  a  b  a      
# l2  x  x  x  y  y
# 0   0  1  2  3  4
# 1   5  6  7  8  9

df.stack(0)  # ?? This currently raises on main

The index should be 0s and 1s from the input's index along with as and bs. However we need to make a choice as to how the column labels that remain (x and y) appear. When we remove the duplicates, we get good behavior:

columns = pd.MultiIndex(
    levels=[["a", "b"], ["x", "y", "z"]],
    codes=[[0, 1, 0], [0, 0, 1]],
    names=["l1", "l2"],
)
df = pd.DataFrame(np.arange(6).reshape((2, 3)), columns=columns)

print(df)
# l1  a  b  a
# l2  x  x  y
# 0   0  1  2
# 1   3  4  5

print(df.stack(0))
# l2    x    y
#   l1        
# 0 a   0  2.0
#   b   1  NaN
# 1 a   3  5.0
#   b   4  NaN

It seems to me the output in the duplicate case is ambiguous - for the two (a, x) and (a, y) columns, which x goes with which y? I think the ambiguities get more treacherous the more levels you have, which makes me lean toward not allowing duplicates in the first place.

@mroeschke
Copy link
Member

For the duplicates case, if we allow the stack operation to return duplicate labels than an output could be:

In [2]: index = pd.MultiIndex.from_arrays([[0, 0, 1, 1], ["a", "b"] * 2], names=[None, "l1"])

In [3]: columns = pd.Index(["x", "x", "y", "y"], name="l2")

In [4]: df = pd.DataFrame([[0, 2, 3, 4], [1, 1, float("nan"), float("nan")], [5, 7, 8, 9], [6, 6, float("nan"), float("nan")]], index=index, columns=columns)

In [5]: df
Out[5]: 
l2    x  x    y    y
  l1                
0 a   0  2  3.0  4.0
  b   1  1  NaN  NaN
1 a   5  7  8.0  9.0
  b   6  6  NaN  NaN

There is a subjectiveness of order here w.r.t. which ax, ay value appears first (went left to right)

@rhshadrach
Copy link
Member Author

rhshadrach commented Jul 25, 2023

I think this would work well if we have a sort_columns argument. However without such an argument, there would be different ordering behaviors based on the existence of duplicates which seems too magical.

If we go the sort_columns route, I'd suggest defaulting to False and raising when False and there are duplicate columns. We'd only support stacking with duplicate columns when sort_columns is True.

I'd still lean toward just not supporting duplicate columns, but am happy to try the sort_columns route.

@mroeschke
Copy link
Member

Yeah I just noted it as an example result, but I agree with not supporting duplicate columns in this case

@rhshadrach
Copy link
Member Author

Closed by #53921

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

3 participants