BUG: DataFrame.stack sometimes sorting the resulting index #53825

rhshadrach · 2023-06-23T21:12:22Z

closes BUG: DataFrame.stack sorting index values in rare cases #53824 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…into stack_sort_bug � Conflicts: � doc/source/whatsnew/v2.1.0.rst

…k_sort_bug

rhshadrach · 2023-06-23T21:14:42Z

doc/source/whatsnew/v2.1.0.rst

@@ -498,7 +498,8 @@ Reshaping
 - Bug in :meth:`DataFrame.idxmin` and :meth:`DataFrame.idxmax`, where the axis dtype would be lost for empty frames (:issue:`53265`)
 - Bug in :meth:`DataFrame.merge` not merging correctly when having ``MultiIndex`` with single level (:issue:`52331`)
 - Bug in :meth:`DataFrame.stack` losing extension dtypes when columns is a :class:`MultiIndex` and frame contains mixed dtypes (:issue:`45740`)
- Bug in :meth:`DataFrame.stack` sorting columns lexicographically (:issue:`53786`)
+- Bug in :meth:`DataFrame.stack` sorting columns lexicographically in rare cases (:issue:`53786`)
+- Bug in :meth:`DataFrame.stack` sorting index lexicographically in rare cases (:issue:`53824`)


There are tons of tests for stacking not sorting the order; only one of them is impacted by this bug. I haven't been able to figure out a way to describe the circumstances this happens under.

… into stack_sort_bug_2 # Conflicts: # doc/source/whatsnew/v2.1.0.rst # pandas/core/reshape/reshape.py

rhshadrach · 2023-06-25T17:40:02Z

@mroeschke - this removes all uses of sort in DataFrame.stack. Looking at #15105 again, it appear to me the cause of all the issues was the unstack behavior. #53298 adding sort=True/False to unstack resolves that issue alone, and the use of sort in unstack argument controls the sorting of the values.

On the other hand, #53282 does not modify the sorting of the values (values here being the levels + codes combination), but rather just the sorting of the levels (the codes are then also adjusted so the values come out the same). This PR essentially only implements sort=False. So I think we can remove the sort argument from stack and still resolve #15105 with this PR. Would you be okay with me doing this here?

mroeschke · 2023-06-26T18:01:56Z

So I think we can remove the sort argument from stack and still resolve #15105 with this PR. Would you be okay with me doing this here?

Hmm it would be nice for API consistency for stack/unstack to both have sort keywords. Would it be difficult to implement sort=True for stack?

rhshadrach · 2023-06-26T20:53:38Z

Would it be difficult to implement sort=True for stack?

No - it's not. But I was leaning the other way: remove sort from unstack after changing it's default to False. To sort, it would just be a call to sort_index(axis=1) after. There is a perf impact here:

arrays = [np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)]
index = MultiIndex.from_arrays(arrays)
df = DataFrame(np.random.randn(10000, 4), index=index)

%timeit df.unstack(1, sort=True)
602 µs ± 5.38 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit df.unstack(1, sort=False)
638 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit df.unstack(1, sort=False).sort_index(axis=1)
1.04 ms ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In general I advocate for not having arguments when it's just an additional call, but perhaps the perf benefit is worth it here?

Somewhat like unstack, we can implement sort=True in stack more efficiently than having to call .sort_index(axis=1) after. If we're going this route, then we'd need to have sort=False as the default for stack, sort=True as the default for unstack, and align their defaults after a deprecation. I'd suggest sort=False as the default.

mroeschke · 2023-06-26T23:10:45Z

In general I advocate for not having arguments when it's just an additional call, but perhaps the perf benefit is worth it here?

Yeah I agree with not having arguments when its an additional call, additionally I think the sort=False behavior is a more natural default behavior so I wouldn't mind moving toward an eventual removal of the keyword where we do not sort by default

…k_sort_bug_2

mroeschke · 2023-06-28T16:10:29Z

Nice thanks @rhshadrach

…v#53825) * BUG: DataFrame.stack with sort=True and unsorted MultiIndex levels * revert ignoring warnings * BUG: DataFrame.stack sorting columns * whatsnew * Docstring fixup * Merge cleanup * WIP * BUG: DataFrame.stack sometimes sorting the resulting index * mypy fixups * Remove sort argument from DataFrame.stack

…53825)" This reverts commit 5307062.

…54068) Revert "BUG: DataFrame.stack sometimes sorting the resulting index (#53825)" This reverts commit 5307062.

rhshadrach added 10 commits June 12, 2023 22:22

BUG: DataFrame.stack with sort=True and unsorted MultiIndex levels

793200b

revert ignoring warnings

d524f1d

BUG: DataFrame.stack sorting columns

24426fa

whatsnew

c333189

Merge branch 'stack_sort_bug' of https://github.com/rhshadrach/pandas …

8b4554d

…into stack_sort_bug � Conflicts: � doc/source/whatsnew/v2.1.0.rst

Merge branch 'main' of https://github.com/pandas-dev/pandas into stac…

7f01a9b

…k_sort_bug

Docstring fixup

739af16

Merge cleanup

9be5486

WIP

2ace984

BUG: DataFrame.stack sometimes sorting the resulting index

3ffb378

rhshadrach added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 23, 2023

rhshadrach added this to the 2.1 milestone Jun 23, 2023

rhshadrach commented Jun 23, 2023

View reviewed changes

rhshadrach added 2 commits June 23, 2023 18:09

mypy fixups

1f7dc61

Merge branch 'stack_sort_bug_2' of https://github.com/rhshadrach/pandas…

3abd2e9

… into stack_sort_bug_2 # Conflicts: # doc/source/whatsnew/v2.1.0.rst # pandas/core/reshape/reshape.py

rhshadrach added 2 commits June 26, 2023 22:31

Merge branch 'main' of https://github.com/pandas-dev/pandas into stac…

9601a8f

…k_sort_bug_2

Remove sort argument from DataFrame.stack

26a68de

rhshadrach requested a review from mroeschke June 28, 2023 00:25

mroeschke approved these changes Jun 28, 2023

View reviewed changes

mroeschke merged commit 5307062 into pandas-dev:main Jun 28, 2023

rhshadrach deleted the stack_sort_bug_2 branch June 28, 2023 18:25

rhshadrach mentioned this pull request Jul 2, 2023

REGR: DataFrame.stack was sometimes sorting resulting index #53969

Closed

jorisvandenbossche mentioned this pull request Jul 4, 2023

DEPR: sort=True in DataFrame.unstack and Series.unstack #53915

Open

rhshadrach added a commit that referenced this pull request Jul 10, 2023

Revert "BUG: DataFrame.stack sometimes sorting the resulting index (#…

77a49aa

…53825)" This reverts commit 5307062.

rhshadrach mentioned this pull request Jul 10, 2023

Revert "BUG: DataFrame.stack sometimes sorting the resulting index" #54068

Merged

rhshadrach added a commit that referenced this pull request Jul 13, 2023

Revert "BUG: DataFrame.stack sometimes sorting the resulting index" (#…

9372d21

…54068) Revert "BUG: DataFrame.stack sometimes sorting the resulting index (#53825)" This reverts commit 5307062.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.stack sometimes sorting the resulting index #53825

BUG: DataFrame.stack sometimes sorting the resulting index #53825

rhshadrach commented Jun 23, 2023

rhshadrach Jun 23, 2023

rhshadrach commented Jun 25, 2023 •

edited

Loading

mroeschke commented Jun 26, 2023

rhshadrach commented Jun 26, 2023 •

edited

Loading

mroeschke commented Jun 26, 2023

mroeschke commented Jun 28, 2023

BUG: DataFrame.stack sometimes sorting the resulting index #53825

BUG: DataFrame.stack sometimes sorting the resulting index #53825

Conversation

rhshadrach commented Jun 23, 2023

rhshadrach Jun 23, 2023

Choose a reason for hiding this comment

rhshadrach commented Jun 25, 2023 • edited Loading

mroeschke commented Jun 26, 2023

rhshadrach commented Jun 26, 2023 • edited Loading

mroeschke commented Jun 26, 2023

mroeschke commented Jun 28, 2023

rhshadrach commented Jun 25, 2023 •

edited

Loading

rhshadrach commented Jun 26, 2023 •

edited

Loading