-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: DataFrame.stack was sometimes sorting resulting index #53969
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
To check my understanding, using a small variation on your first example above with a MultiIndex columns, having both non-sorted level and codes in the MultiIndex level in the columns we are stacking:
With pandas 2.0.x, we get:
So in this case, the level are kept as is, but the codes (for the added level) are sorted (and thus in this case the result actually appears to be unsorted because of the way I constructed the original dataframe ..). With pandas main I currently get:
In this case, both the level order as codes order is left as is (i.e. not any kind of sorting), and thus preserves the order of the column level in the original dataframe. In practice, this also means that if your the codes were sorted in your original dataframe (which I think is probably the most common case?), regardless of the order of the labels in the level, nothing will change compared to the current behaviour on pandas 2.0.x? (since we were only sorting the codes, not level labels) Given the above, I think this is probably fine. Although, in theory, it's a case that we can relatively easily detect and raise a warning in advance of changing it? (if we would keep the actual change for 3.0, for example together with switching to the new implementation) One other observation: it seems that on main, we now also preserve the order of the remaining columns in the result (see |
Correct. I am suggesting we call these inconsistencies (sometimes sorting, sometimes not) bugs and can highlight them in the notable bug fixes. The other option would be to revert the changes and introduce deprecations. |
Friendly ping @mroeschke. @jorisvandenbossche - are you okay calling these bugs (see my previous comment)? |
I think I would agree to call inconsistent sorting a notable bug fix, especially since consistently not sorting seems to be a better desired behavior from users given the past issues |
I am OK with that. One more question: how much is the change in behaviour that is already present in main tied to the current default implementation? Or, asked differently, how hard would it be to only have the change in sorting in your new implementation? We will probably already add a keyword for that anyway (to opt in to that new implementation), and then switch to use that by default in 3.0. In theory this sorting behaviour change could be part of that v3 impl and upgrade path? |
None, they are entirely independent implementations.
That would work. I thought of doing this too, but was worried about the downside of replacing the implementation and having more significant behavior changes. Rethinking it, I'm now seeing this is better than potentially breaking code given the long standing, even if undesirable, behavior. |
#53825 removed the
sort
argument for DataFrame.stack that was added in #53282 due to #15105, the default beingsort=True
. Prior to #53825, I spent a while trying the implementation in #53921 (where I had implemented sort=True), and test after test was failing. Implementing insteadsort=False
made almost every test pass, and looking at a bunch of examples I concluded that sort wasn't actually sorting. It turns out this is not exactly the case.Note both #53282 and #53825 were added as part of 2.1 development, so below I'm comparing to / running on 2.0.x.
On 2.0.x, DataFrame.stack will not sort any part of the resulting index when the input's columns are not a MultiIndex. When the input columns are a MultiIndex, the result will sort the levels of the resulting index that come from the columns. In this case, the result will be sorted not necessarily in lexicographic order, but rather the order of
levels
that are in the input's columns. This is opposed to e.g.sort_index
which always sorts lexicographically rather than relying on the order inlevels.
I think we're okay with the behavior in main as is: it replaces the sometimes sorting behavior with a consistent never-sort behavior. From #15105, it seems like sort=False is the more desirable of the two; users can always sort the index after stacking if they so choose. However, I wanted to raise this issue now that I have a better understanding of the behavior on 2.0.x to make sure others agree.
If we are okay with the behavior, I'm going to make this change more prominent in the whatsnew.
Examples on 2.0.x
cc @mroeschke @jorisvandenbossche
The text was updated successfully, but these errors were encountered: