BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex. #8855

seth-p · 2014-11-19T07:09:27Z

Fixes DataFrame.stack(..., dropna=False) when the columns consist of a "partial" MultiIndex, i.e. one in which the labels don't reference all the levels.

jreback · 2014-11-19T11:18:46Z

pandas/tests/test_frame.py

+        partial_mi = full_mi[:2]
+        df = DataFrame(np.zeros((5, 2)), columns=partial_mi)
+
+        for level in [-1, 0, 1, [0, 1], [1, 0]]:


maybe add a test which compares against a known result (e.g. that you are not using df.stack to create, e.g. construct the resultant/frame series. don't need for all, maybe just say [1,0].)

Yep, still a work in progress...

jreback · 2014-11-30T17:36:21Z

this looks fine

cc @rockg

@seth-p want to do a quick speed test (maybe even add a vbench?)

seth-p · 2014-11-30T18:05:24Z

Afraid I'm not set up to run vbench (on windows). Can do simple speed tests Monday or Tuesday.

seth-p · 2014-11-30T18:08:07Z

Any comments on my question at #8844 (comment)?

jreback · 2014-11-30T18:23:48Z

You are stacking in the SAME order as presented yes (so if it happens to be sorted as is most of the time), then the output columns will be in that order, otherwise they will be in the indicated order, yes?

I think that is correct and fine. Granted you almost always have a sorted MultiIndex, but it shouldn't prevent it from working. I would guarantee that input order -> output order, and not required sortedness.

seth-p · 2014-11-30T23:05:59Z

Yes, I am keeping them in the same order they appear in the MultiIndex.levels -- which is not necessarily the order in which they appear in the MultiIndex.labels. This reproduces the existing behavior when all levels are used.

seth-p · 2014-12-02T04:34:15Z

The proposed change has no effect for a super-long DataFrame (as suggested in #8844 (comment)):
Existing code:

In [12]: df1 = pd.DataFrame(np.arange(500000*10).reshape(500000,10),
                            columns=pd.MultiIndex.from_tuples([(100*x, -y) for x in range(5)
                                                                           for y in range(2)],
                                                              names=['X', 'Y']))

In [13]: %timeit df1.stack()
10 loops, best of 3: 156 ms per loop

In [14]: %timeit df1.stack(level=0)
1 loops, best of 3: 254 ms per loop

In [15]: %timeit df1.stack(level=1)
10 loops, best of 3: 156 ms per loop

In [17]: %timeit df1.stack(level=[0,1])
1 loops, best of 3: 382 ms per loop

With this PR:

In [12]: %timeit df1.stack()
10 loops, best of 3: 156 ms per loop

In [13]: %timeit df1.stack(level=0)
1 loops, best of 3: 252 ms per loop

In [14]: %timeit df1.stack(level=1)
10 loops, best of 3: 156 ms per loop

In [15]: %timeit df1.stack(level=[0,1])
1 loops, best of 3: 381 ms per loop

For a super-wide DataFrame, the new code is a tad slower, but I think tolerably so:
Existing code:

In [3]: df = pd.DataFrame(np.arange(5*500000).reshape(5,500000), index=list('abcde'),
                          columns=pd.MultiIndex.from_tuples([(100*x, -y) for x in range(500)
                                                                         for y in range(1000)],
                                                            names=['X', 'Y']))

In [4]: %timeit df.stack()
1 loops, best of 3: 782 ms per loop

In [6]: %timeit df.stack(level=0)
1 loops, best of 3: 1.43 s per loop

In [8]: %timeit df.stack(level=1)
1 loops, best of 3: 758 ms per loop

In [18]: %timeit df.stack(level=[0,1])
1 loops, best of 3: 1.51 s per loop

With this PR:

In [4]: %timeit df.stack()
1 loops, best of 3: 862 ms per loop

In [6]: %timeit df.stack(level=0)
1 loops, best of 3: 1.53 s per loop

In [8]: %timeit df.stack(level=1)
1 loops, best of 3: 867 ms per loop

In [16]: %timeit df.stack(level=[0,1])
1 loops, best of 3: 1.62 s per loop

BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex.

jreback · 2014-12-02T11:14:53Z

thanks!

jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 19, 2014

jreback added this to the 0.15.2 milestone Nov 19, 2014

jreback reviewed Nov 19, 2014
View reviewed changes

seth-p force-pushed the multiindex_stacking branch 4 times, most recently from 31529a0 to f10ca8a Compare November 20, 2014 00:20

BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex.

c350118

seth-p force-pushed the multiindex_stacking branch from f10ca8a to c350118 Compare November 20, 2014 00:42

seth-p changed the title ~~WIP/BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex.~~ BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex. Nov 20, 2014

seth-p mentioned this pull request Nov 20, 2014

BUG: .stack(dropna=False) looks through views incorrectly for dataframe views with multi-index columns #8844

Closed

jreback added a commit that referenced this pull request Dec 2, 2014

Merge pull request #8855 from seth-p/multiindex_stacking

2063c1f

BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex.

jreback merged commit 2063c1f into pandas-dev:master Dec 2, 2014

seth-p deleted the multiindex_stacking branch December 2, 2014 20:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex. #8855

BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex. #8855

seth-p commented Nov 19, 2014

jreback Nov 19, 2014

seth-p Nov 19, 2014

jreback commented Nov 30, 2014

seth-p commented Nov 30, 2014

seth-p commented Nov 30, 2014

jreback commented Nov 30, 2014

seth-p commented Nov 30, 2014

seth-p commented Dec 2, 2014

jreback commented Dec 2, 2014

BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex. #8855

BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex. #8855

Conversation

seth-p commented Nov 19, 2014

jreback Nov 19, 2014

Choose a reason for hiding this comment

seth-p Nov 19, 2014

Choose a reason for hiding this comment

jreback commented Nov 30, 2014

seth-p commented Nov 30, 2014

seth-p commented Nov 30, 2014

jreback commented Nov 30, 2014

seth-p commented Nov 30, 2014

seth-p commented Dec 2, 2014

jreback commented Dec 2, 2014