Skip to content

BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex. #8855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 2, 2014

Conversation

seth-p
Copy link
Contributor

@seth-p seth-p commented Nov 19, 2014

Closes #8844

Fixes DataFrame.stack(..., dropna=False) when the columns consist of a "partial" MultiIndex, i.e. one in which the labels don't reference all the levels.

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 19, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 19, 2014
partial_mi = full_mi[:2]
df = DataFrame(np.zeros((5, 2)), columns=partial_mi)

for level in [-1, 0, 1, [0, 1], [1, 0]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a test which compares against a known result (e.g. that you are not using df.stack to create, e.g. construct the resultant/frame series. don't need for all, maybe just say [1,0].)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, still a work in progress...

@seth-p seth-p force-pushed the multiindex_stacking branch 4 times, most recently from 31529a0 to f10ca8a Compare November 20, 2014 00:20
@seth-p seth-p force-pushed the multiindex_stacking branch from f10ca8a to c350118 Compare November 20, 2014 00:42
@seth-p seth-p changed the title WIP/BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex. BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex. Nov 20, 2014
@jreback
Copy link
Contributor

jreback commented Nov 30, 2014

this looks fine

cc @rockg

@seth-p want to do a quick speed test (maybe even add a vbench?)

@seth-p
Copy link
Contributor Author

seth-p commented Nov 30, 2014

Afraid I'm not set up to run vbench (on windows). Can do simple speed tests Monday or Tuesday.

@seth-p
Copy link
Contributor Author

seth-p commented Nov 30, 2014

Any comments on my question at #8844 (comment)?

@jreback
Copy link
Contributor

jreback commented Nov 30, 2014

You are stacking in the SAME order as presented yes (so if it happens to be sorted as is most of the time), then the output columns will be in that order, otherwise they will be in the indicated order, yes?

I think that is correct and fine. Granted you almost always have a sorted MultiIndex, but it shouldn't prevent it from working. I would guarantee that input order -> output order, and not required sortedness.

@seth-p
Copy link
Contributor Author

seth-p commented Nov 30, 2014

Yes, I am keeping them in the same order they appear in the MultiIndex.levels -- which is not necessarily the order in which they appear in the MultiIndex.labels. This reproduces the existing behavior when all levels are used.

@seth-p
Copy link
Contributor Author

seth-p commented Dec 2, 2014

The proposed change has no effect for a super-long DataFrame (as suggested in #8844 (comment)):
Existing code:

In [12]: df1 = pd.DataFrame(np.arange(500000*10).reshape(500000,10),
                            columns=pd.MultiIndex.from_tuples([(100*x, -y) for x in range(5)
                                                                           for y in range(2)],
                                                              names=['X', 'Y']))

In [13]: %timeit df1.stack()
10 loops, best of 3: 156 ms per loop

In [14]: %timeit df1.stack(level=0)
1 loops, best of 3: 254 ms per loop

In [15]: %timeit df1.stack(level=1)
10 loops, best of 3: 156 ms per loop

In [17]: %timeit df1.stack(level=[0,1])
1 loops, best of 3: 382 ms per loop

With this PR:

In [12]: %timeit df1.stack()
10 loops, best of 3: 156 ms per loop

In [13]: %timeit df1.stack(level=0)
1 loops, best of 3: 252 ms per loop

In [14]: %timeit df1.stack(level=1)
10 loops, best of 3: 156 ms per loop

In [15]: %timeit df1.stack(level=[0,1])
1 loops, best of 3: 381 ms per loop

For a super-wide DataFrame, the new code is a tad slower, but I think tolerably so:
Existing code:

In [3]: df = pd.DataFrame(np.arange(5*500000).reshape(5,500000), index=list('abcde'),
                          columns=pd.MultiIndex.from_tuples([(100*x, -y) for x in range(500)
                                                                         for y in range(1000)],
                                                            names=['X', 'Y']))

In [4]: %timeit df.stack()
1 loops, best of 3: 782 ms per loop

In [6]: %timeit df.stack(level=0)
1 loops, best of 3: 1.43 s per loop

In [8]: %timeit df.stack(level=1)
1 loops, best of 3: 758 ms per loop

In [18]: %timeit df.stack(level=[0,1])
1 loops, best of 3: 1.51 s per loop

With this PR:

In [4]: %timeit df.stack()
1 loops, best of 3: 862 ms per loop

In [6]: %timeit df.stack(level=0)
1 loops, best of 3: 1.53 s per loop

In [8]: %timeit df.stack(level=1)
1 loops, best of 3: 867 ms per loop

In [16]: %timeit df.stack(level=[0,1])
1 loops, best of 3: 1.62 s per loop

jreback added a commit that referenced this pull request Dec 2, 2014
BUG: DataFrame.stack(..., dropna=False) with partial MultiIndex.
@jreback jreback merged commit 2063c1f into pandas-dev:master Dec 2, 2014
@jreback
Copy link
Contributor

jreback commented Dec 2, 2014

thanks!

@seth-p seth-p deleted the multiindex_stacking branch December 2, 2014 20:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: .stack(dropna=False) looks through views incorrectly for dataframe views with multi-index columns
2 participants