Skip to content

BUG: DataFrame.stack sometimes sorting the resulting index #53825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 28, 2023

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 23, 2023
@rhshadrach rhshadrach added this to the 2.1 milestone Jun 23, 2023
@@ -498,7 +498,8 @@ Reshaping
- Bug in :meth:`DataFrame.idxmin` and :meth:`DataFrame.idxmax`, where the axis dtype would be lost for empty frames (:issue:`53265`)
- Bug in :meth:`DataFrame.merge` not merging correctly when having ``MultiIndex`` with single level (:issue:`52331`)
- Bug in :meth:`DataFrame.stack` losing extension dtypes when columns is a :class:`MultiIndex` and frame contains mixed dtypes (:issue:`45740`)
- Bug in :meth:`DataFrame.stack` sorting columns lexicographically (:issue:`53786`)
- Bug in :meth:`DataFrame.stack` sorting columns lexicographically in rare cases (:issue:`53786`)
- Bug in :meth:`DataFrame.stack` sorting index lexicographically in rare cases (:issue:`53824`)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tons of tests for stacking not sorting the order; only one of them is impacted by this bug. I haven't been able to figure out a way to describe the circumstances this happens under.

… into stack_sort_bug_2

# Conflicts:
#	doc/source/whatsnew/v2.1.0.rst
#	pandas/core/reshape/reshape.py
@rhshadrach
Copy link
Member Author

rhshadrach commented Jun 25, 2023

@mroeschke - this removes all uses of sort in DataFrame.stack. Looking at #15105 again, it appear to me the cause of all the issues was the unstack behavior. #53298 adding sort=True/False to unstack resolves that issue alone, and the use of sort in unstack argument controls the sorting of the values.

On the other hand, #53282 does not modify the sorting of the values (values here being the levels + codes combination), but rather just the sorting of the levels (the codes are then also adjusted so the values come out the same). This PR essentially only implements sort=False. So I think we can remove the sort argument from stack and still resolve #15105 with this PR. Would you be okay with me doing this here?

@mroeschke
Copy link
Member

So I think we can remove the sort argument from stack and still resolve #15105 with this PR. Would you be okay with me doing this here?

Hmm it would be nice for API consistency for stack/unstack to both have sort keywords. Would it be difficult to implement sort=True for stack?

@rhshadrach
Copy link
Member Author

rhshadrach commented Jun 26, 2023

Would it be difficult to implement sort=True for stack?

No - it's not. But I was leaning the other way: remove sort from unstack after changing it's default to False. To sort, it would just be a call to sort_index(axis=1) after. There is a perf impact here:

arrays = [np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)]
index = MultiIndex.from_arrays(arrays)
df = DataFrame(np.random.randn(10000, 4), index=index)

%timeit df.unstack(1, sort=True)
602 µs ± 5.38 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit df.unstack(1, sort=False)
638 µs ± 2.78 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit df.unstack(1, sort=False).sort_index(axis=1)
1.04 ms ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In general I advocate for not having arguments when it's just an additional call, but perhaps the perf benefit is worth it here?

Somewhat like unstack, we can implement sort=True in stack more efficiently than having to call .sort_index(axis=1) after. If we're going this route, then we'd need to have sort=False as the default for stack, sort=True as the default for unstack, and align their defaults after a deprecation. I'd suggest sort=False as the default.

@mroeschke
Copy link
Member

In general I advocate for not having arguments when it's just an additional call, but perhaps the perf benefit is worth it here?

Yeah I agree with not having arguments when its an additional call, additionally I think the sort=False behavior is a more natural default behavior so I wouldn't mind moving toward an eventual removal of the keyword where we do not sort by default

@rhshadrach rhshadrach requested a review from mroeschke June 28, 2023 00:25
@mroeschke mroeschke merged commit 5307062 into pandas-dev:main Jun 28, 2023
@mroeschke
Copy link
Member

Nice thanks @rhshadrach

@rhshadrach rhshadrach deleted the stack_sort_bug_2 branch June 28, 2023 18:25
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
…v#53825)

* BUG: DataFrame.stack with sort=True and unsorted MultiIndex levels

* revert ignoring warnings

* BUG: DataFrame.stack sorting columns

* whatsnew

* Docstring fixup

* Merge cleanup

* WIP

* BUG: DataFrame.stack sometimes sorting the resulting index

* mypy fixups

* Remove sort argument from DataFrame.stack
rhshadrach added a commit that referenced this pull request Jul 10, 2023
rhshadrach added a commit that referenced this pull request Jul 13, 2023
…54068)

Revert "BUG: DataFrame.stack sometimes sorting the resulting index (#53825)"

This reverts commit 5307062.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.stack sorting index values in rare cases
2 participants