-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
POC: DataFrame.stack not including NA rows #53756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jorisvandenbossche - there are some cases where DataFrame.stack sorts the result; I believe these are bugs. Other than this, the only tricky situation I've encountered is handling duplicates in the column that is being stacked. In this POC, I raise a ValueError instead. See #53761 |
… poc_stack � Conflicts: � pandas/core/reshape/reshape.py � pandas/tests/frame/test_stack_unstack.py
): | ||
raise ValueError( | ||
"level should contain all level names or all level " | ||
"numbers, not a mixture of the two." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a new restriction, or just moving up from the previous stack_multiple
implementation?
(it certainly sounds as a good restriction, though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No - this already exists in the current implementation.
# Construct the correct MultiIndex by combining the frame's index and | ||
# stacked columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to go back to your original simpler implementation, now that MultiIndex.append / concat performance has been improved (#53697).
Didn't check if your custom version here is still faster though, but that PR implemented my suggested improvement from profiling this stack use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have to guess that constructing a MultiIndex once is faster than constructing a MultiIndex many times and concat'ing them together. In any case, the complexity really arose from getting this to pass our suite of tests. The tests we have for stack/unstack are quite thorough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed, just reusing the levels and repeating or tiling the existing codes should always be (at least a bit) faster than concatting
@@ -876,3 +881,110 @@ def _reorder_for_extension_array_stack( | |||
# c0r1, c1r1, c2r1, ...] | |||
idx = np.arange(n_rows * n_columns).reshape(n_columns, n_rows).T.ravel() | |||
return arr.take(idx) | |||
|
|||
|
|||
def stack_v2(frame, level: list[int], dropna: bool = True, sort: bool = True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sort
keyword is currently ignored in this implementation?
Closing in favor of #53921. @jorisvandenbossche - happy to continue any discussion here, or move over to #53921. |
Demonstration for #53515
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.