POC: DataFrame.stack not including NA rows #53756

rhshadrach · 2023-06-21T00:25:50Z

Demonstration for #53515

Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…stack

rhshadrach · 2023-06-21T02:44:34Z

@jorisvandenbossche - there are some cases where DataFrame.stack sorts the result; I believe these are bugs. Other than this, the only tricky situation I've encountered is handling duplicates in the column that is being stacked. In this POC, I raise a ValueError instead. See #53761

…stack

… poc_stack � Conflicts: � pandas/core/reshape/reshape.py � pandas/tests/frame/test_stack_unstack.py

jorisvandenbossche · 2023-06-28T08:49:25Z

pandas/core/frame.py

+        ):
+            raise ValueError(
+                "level should contain all level names or all level "
+                "numbers, not a mixture of the two."


Is this a new restriction, or just moving up from the previous stack_multiple implementation?

(it certainly sounds as a good restriction, though)

No - this already exists in the current implementation.

jorisvandenbossche · 2023-06-28T08:54:57Z

pandas/core/reshape/reshape.py

+    # Construct the correct MultiIndex by combining the frame's index and
+    # stacked columns.


You might be able to go back to your original simpler implementation, now that MultiIndex.append / concat performance has been improved (#53697).
Didn't check if your custom version here is still faster though, but that PR implemented my suggested improvement from profiling this stack use case.

I would have to guess that constructing a MultiIndex once is faster than constructing a MultiIndex many times and concat'ing them together. In any case, the complexity really arose from getting this to pass our suite of tests. The tests we have for stack/unstack are quite thorough.

Yes, indeed, just reusing the levels and repeating or tiling the existing codes should always be (at least a bit) faster than concatting

jorisvandenbossche · 2023-06-28T08:55:47Z

pandas/core/reshape/reshape.py

@@ -876,3 +881,110 @@ def _reorder_for_extension_array_stack(
    #  c0r1, c1r1, c2r1, ...]
    idx = np.arange(n_rows * n_columns).reshape(n_columns, n_rows).T.ravel()
    return arr.take(idx)
+
+
+def stack_v2(frame, level: list[int], dropna: bool = True, sort: bool = True):


The sort keyword is currently ignored in this implementation?

rhshadrach · 2023-06-29T03:23:51Z

Closing in favor of #53921. @jorisvandenbossche - happy to continue any discussion here, or move over to #53921.

rhshadrach added 10 commits June 12, 2023 18:37

WIP

848081c

WIP

fa46971

Merge branch 'main' of https://github.com/pandas-dev/pandas into poc_…

6cd80e9

…stack

FWV

bcf9eee

Refinements

07d4683

Ban duplicate values in columns when stacking

db02720

Some refactors

c298022

Some refactors

4781906

Some refactors

d4ca34c

POC: DataFrame.stack not including NA rows

a3d5632

rhshadrach added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 21, 2023

rhshadrach added 5 commits June 23, 2023 17:18

Merge branch 'main' of https://github.com/pandas-dev/pandas into poc_…

c0f3897

…stack

Merge branch 'poc_stack' of https://github.com/rhshadrach/pandas into…

28b6a52

… poc_stack � Conflicts: � pandas/core/reshape/reshape.py � pandas/tests/frame/test_stack_unstack.py

merge cleanup

5c62b3a

merge cleanup

2901557

cleanup

0f72696

jorisvandenbossche reviewed Jun 28, 2023

View reviewed changes

rhshadrach mentioned this pull request Jun 29, 2023

DEPR: DataFrame.stack on columns containing duplicate values #53761

Closed

rhshadrach closed this Jun 29, 2023

rhshadrach deleted the poc_stack branch September 27, 2023 21:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: DataFrame.stack not including NA rows #53756

POC: DataFrame.stack not including NA rows #53756

rhshadrach commented Jun 21, 2023

rhshadrach commented Jun 21, 2023

jorisvandenbossche Jun 28, 2023

rhshadrach Jun 29, 2023

jorisvandenbossche Jun 28, 2023

rhshadrach Jun 29, 2023

jorisvandenbossche Jun 29, 2023

jorisvandenbossche Jun 28, 2023

rhshadrach commented Jun 29, 2023

		# Construct the correct MultiIndex by combining the frame's index and
		# stacked columns.

POC: DataFrame.stack not including NA rows #53756

POC: DataFrame.stack not including NA rows #53756

Conversation

rhshadrach commented Jun 21, 2023

rhshadrach commented Jun 21, 2023

jorisvandenbossche Jun 28, 2023

Choose a reason for hiding this comment

rhshadrach Jun 29, 2023

Choose a reason for hiding this comment

jorisvandenbossche Jun 28, 2023

Choose a reason for hiding this comment

rhshadrach Jun 29, 2023

Choose a reason for hiding this comment

jorisvandenbossche Jun 29, 2023

Choose a reason for hiding this comment

jorisvandenbossche Jun 28, 2023

Choose a reason for hiding this comment

rhshadrach commented Jun 29, 2023