Skip to content

POC: DataFrame.stack not including NA rows #53756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

rhshadrach
Copy link
Member

Demonstration for #53515

@rhshadrach
Copy link
Member Author

@jorisvandenbossche - there are some cases where DataFrame.stack sorts the result; I believe these are bugs. Other than this, the only tricky situation I've encountered is handling duplicates in the column that is being stacked. In this POC, I raise a ValueError instead. See #53761

@rhshadrach rhshadrach added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 21, 2023
):
raise ValueError(
"level should contain all level names or all level "
"numbers, not a mixture of the two."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a new restriction, or just moving up from the previous stack_multiple implementation?

(it certainly sounds as a good restriction, though)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - this already exists in the current implementation.

Comment on lines +948 to +949
# Construct the correct MultiIndex by combining the frame's index and
# stacked columns.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to go back to your original simpler implementation, now that MultiIndex.append / concat performance has been improved (#53697).
Didn't check if your custom version here is still faster though, but that PR implemented my suggested improvement from profiling this stack use case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to guess that constructing a MultiIndex once is faster than constructing a MultiIndex many times and concat'ing them together. In any case, the complexity really arose from getting this to pass our suite of tests. The tests we have for stack/unstack are quite thorough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed, just reusing the levels and repeating or tiling the existing codes should always be (at least a bit) faster than concatting

@@ -876,3 +881,110 @@ def _reorder_for_extension_array_stack(
# c0r1, c1r1, c2r1, ...]
idx = np.arange(n_rows * n_columns).reshape(n_columns, n_rows).T.ravel()
return arr.take(idx)


def stack_v2(frame, level: list[int], dropna: bool = True, sort: bool = True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort keyword is currently ignored in this implementation?

@rhshadrach
Copy link
Member Author

Closing in favor of #53921. @jorisvandenbossche - happy to continue any discussion here, or move over to #53921.

@rhshadrach rhshadrach closed this Jun 29, 2023
@rhshadrach rhshadrach deleted the poc_stack branch September 27, 2023 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants