-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DEPR: DataFrame.stack on columns containing duplicate values #53761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@mroeschke @phofl @jbrockmendel - does this seem like a reasonable restriction for using DataFrame.stack? |
Would you only deprecate this for the MultiIndex case? Or for both cases? And do you have any idea how hard it would be to fix the bug instead for the MultiIndex case? (I am personally fine with deprecating it, but on the other hand we have many places in pandas where we support duplicate indices, even if we never recommend this for users to actually do this. So if it is fixable without too much complexity (there is no ambiguity in this case, so it could work), that could also be more consistent) |
For the new implementation (#53756), I worked at it for quite a bit before giving up. But I'd like to go back now and take another stab at it. It seems theoretically possible. I'll report back in a few days. |
Looking at your new implementation (for the first example with a MultiIndex above), this might indeed not be that straightforward: in the end we are doing a |
The idea I was playing around with before is to take each of the |
@jorisvandenbossche @mroeschke I'm looking more into this issue. What's the expected output here:
The index should be 0s and 1s from the input's index along with
It seems to me the output in the duplicate case is ambiguous - for the two (a, x) and (a, y) columns, which x goes with which y? I think the ambiguities get more treacherous the more levels you have, which makes me lean toward not allowing duplicates in the first place. |
For the duplicates case, if we allow the
There is a subjectiveness of order here w.r.t. which ax, ay value appears first (went left to right) |
I think this would work well if we have a If we go the I'd still lean toward just not supporting duplicate columns, but am happy to try the |
Yeah I just noted it as an example result, but I agree with not supporting duplicate columns in this case |
Closed by #53921 |
With MultiIndex columns, we get incorrect results
In particular, the value of
df
indexed by(0, (a, x))
is 2, and this gets moved to the value indexed by((0, b), x)
.Taking the same example but with an Index gives a more reasonable result:
However, I think this is still inconsistent with the rest of stack: it's the only case where the values in the index coming from the columns have duplicate values.
Accessing particular subsets of a DataFrame when the columns have duplicate values is fraught with difficulty. I think we should deprecate supporting duplicate values and in the future raise instead.
The text was updated successfully, but these errors were encountered: