DEPR: DataFrame.stack on columns containing duplicate values #53761

rhshadrach · 2023-06-21T02:42:39Z

With MultiIndex columns, we get incorrect results

columns = pd.MultiIndex(
    levels=[["a", "b"], ["x", "y"]],
    codes=[[0, 1, 0], [0, 1, 0]],
    names=["l1", "l2"],
)
df = pd.DataFrame(np.arange(6).reshape((2, 3)), columns=columns)
print(df)
# l1  a  b  a
# l2  x  y  x
# 0   0  1  2
# 1   3  4  5
  
print(df.stack(0))
# l2    x    y
#   l1
# 0 a   0  NaN
#   b   2  1.0
# 1 a   3  NaN
#   b   5  4.0

In particular, the value of df indexed by (0, (a, x)) is 2, and this gets moved to the value indexed by ((0, b), x).

Taking the same example but with an Index gives a more reasonable result:

df = df.droplevel(1, axis=1)
print(df)
# l1  a  b  a
# 0   0  1  2
# 1   3  4  5

print(df.stack(0))
#    l1
# 0  a     0
#    b     1
#    a     2
# 1  a     3
#    b     4
#    a     5
# dtype: int64

However, I think this is still inconsistent with the rest of stack: it's the only case where the values in the index coming from the columns have duplicate values.

Accessing particular subsets of a DataFrame when the columns have duplicate values is fraught with difficulty. I think we should deprecate supporting duplicate values and in the future raise instead.

The text was updated successfully, but these errors were encountered:

rhshadrach · 2023-06-23T21:23:12Z

@mroeschke @phofl @jbrockmendel - does this seem like a reasonable restriction for using DataFrame.stack?

jorisvandenbossche · 2023-06-28T08:46:33Z

Would you only deprecate this for the MultiIndex case? Or for both cases?

And do you have any idea how hard it would be to fix the bug instead for the MultiIndex case?

(I am personally fine with deprecating it, but on the other hand we have many places in pandas where we support duplicate indices, even if we never recommend this for users to actually do this. So if it is fixable without too much complexity (there is no ambiguity in this case, so it could work), that could also be more consistent)

rhshadrach · 2023-06-29T00:25:23Z

For the new implementation (#53756), I worked at it for quite a bit before giving up. But I'd like to go back now and take another stab at it. It seems theoretically possible. I'll report back in a few days.

jorisvandenbossche · 2023-06-29T07:18:04Z

Looking at your new implementation (for the first example with a MultiIndex above), this might indeed not be that straightforward: in the end we are doing a concat([]) of the subsetted parts, and if those don't have exactly the same column labels, concat does a reindex, and reindexing doesn't support duplicate labels ..

rhshadrach · 2023-06-29T21:28:44Z

The idea I was playing around with before is to take each of the .loc results that need to be concat'ed, replace any duplicate column labels with distinct labels, concat, and then undo the replacement. But it was getting complicated and there were lots of other issues to be resolved at the time. I still think this may be a viable approach.

rhshadrach · 2023-07-22T17:57:14Z

@jorisvandenbossche @mroeschke I'm looking more into this issue. What's the expected output here:

columns = pd.MultiIndex(
    levels=[["a", "b"], ["x", "y", "z"]],
    codes=[[0, 1, 0, 0, 0], [0, 0, 0, 1, 1]],
    names=["l1", "l2"],
)
df = pd.DataFrame(np.arange(10).reshape((2, 5)), columns=columns)

print(df)
# l1  a  b  a      
# l2  x  x  x  y  y
# 0   0  1  2  3  4
# 1   5  6  7  8  9

df.stack(0)  # ?? This currently raises on main

The index should be 0s and 1s from the input's index along with as and bs. However we need to make a choice as to how the column labels that remain (x and y) appear. When we remove the duplicates, we get good behavior:

columns = pd.MultiIndex(
    levels=[["a", "b"], ["x", "y", "z"]],
    codes=[[0, 1, 0], [0, 0, 1]],
    names=["l1", "l2"],
)
df = pd.DataFrame(np.arange(6).reshape((2, 3)), columns=columns)

print(df)
# l1  a  b  a
# l2  x  x  y
# 0   0  1  2
# 1   3  4  5

print(df.stack(0))
# l2    x    y
#   l1        
# 0 a   0  2.0
#   b   1  NaN
# 1 a   3  5.0
#   b   4  NaN

It seems to me the output in the duplicate case is ambiguous - for the two (a, x) and (a, y) columns, which x goes with which y? I think the ambiguities get more treacherous the more levels you have, which makes me lean toward not allowing duplicates in the first place.

mroeschke · 2023-07-25T20:49:19Z

For the duplicates case, if we allow the stack operation to return duplicate labels than an output could be:

In [2]: index = pd.MultiIndex.from_arrays([[0, 0, 1, 1], ["a", "b"] * 2], names=[None, "l1"])

In [3]: columns = pd.Index(["x", "x", "y", "y"], name="l2")

In [4]: df = pd.DataFrame([[0, 2, 3, 4], [1, 1, float("nan"), float("nan")], [5, 7, 8, 9], [6, 6, float("nan"), float("nan")]], index=index, columns=columns)

In [5]: df
Out[5]: 
l2    x  x    y    y
  l1                
0 a   0  2  3.0  4.0
  b   1  1  NaN  NaN
1 a   5  7  8.0  9.0
  b   6  6  NaN  NaN

There is a subjectiveness of order here w.r.t. which ax, ay value appears first (went left to right)

rhshadrach · 2023-07-25T21:18:52Z

I think this would work well if we have a sort_columns argument. However without such an argument, there would be different ordering behaviors based on the existence of duplicates which seems too magical.

If we go the sort_columns route, I'd suggest defaulting to False and raising when False and there are duplicate columns. We'd only support stacking with duplicate columns when sort_columns is True.

I'd still lean toward just not supporting duplicate columns, but am happy to try the sort_columns route.

mroeschke · 2023-07-25T21:57:00Z

Yeah I just noted it as an example result, but I agree with not supporting duplicate columns in this case

rhshadrach · 2023-08-03T01:43:10Z

Closed by #53921

rhshadrach added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action labels Jun 21, 2023

rhshadrach mentioned this issue Jun 21, 2023

POC: DataFrame.stack not including NA rows #53756

Closed

4 tasks

rhshadrach changed the title ~~DEPR: DataFrame.stack with on columns containing duplicate values~~ DEPR: DataFrame.stack on columns containing duplicate values Jun 23, 2023

rhshadrach mentioned this issue Jul 31, 2023

ENH: Add new implementation of DataFrame.stack #53921

Merged

5 tasks

rhshadrach closed this as completed Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR: DataFrame.stack on columns containing duplicate values #53761

DEPR: DataFrame.stack on columns containing duplicate values #53761

rhshadrach commented Jun 21, 2023 •

edited

Loading

rhshadrach commented Jun 23, 2023 •

edited

Loading

jorisvandenbossche commented Jun 28, 2023

rhshadrach commented Jun 29, 2023

jorisvandenbossche commented Jun 29, 2023

rhshadrach commented Jun 29, 2023

rhshadrach commented Jul 22, 2023 •

edited

Loading

mroeschke commented Jul 25, 2023

rhshadrach commented Jul 25, 2023 •

edited

Loading

mroeschke commented Jul 25, 2023

rhshadrach commented Aug 3, 2023

DEPR: DataFrame.stack on columns containing duplicate values #53761

DEPR: DataFrame.stack on columns containing duplicate values #53761

Comments

rhshadrach commented Jun 21, 2023 • edited Loading

rhshadrach commented Jun 23, 2023 • edited Loading

jorisvandenbossche commented Jun 28, 2023

rhshadrach commented Jun 29, 2023

jorisvandenbossche commented Jun 29, 2023

rhshadrach commented Jun 29, 2023

rhshadrach commented Jul 22, 2023 • edited Loading

mroeschke commented Jul 25, 2023

rhshadrach commented Jul 25, 2023 • edited Loading

mroeschke commented Jul 25, 2023

rhshadrach commented Aug 3, 2023

rhshadrach commented Jun 21, 2023 •

edited

Loading

rhshadrach commented Jun 23, 2023 •

edited

Loading

rhshadrach commented Jul 22, 2023 •

edited

Loading

rhshadrach commented Jul 25, 2023 •

edited

Loading