PERF/API: fast paths for product MultiIndex? #15503

adbull · 2017-02-25T12:52:38Z

Feature Proposal

At the moment, we have a few different methods for storing indexed higher-dimensional arrays:

DataFrame/Series with a product MultiIndex (PMI), which can be slow
Panel, which is fast, but less flexible, and may be deprecated soon (Deprecation of Panel ? #13563)
xarray, which is designed for this, but has a smaller API

For some datasets, I've found the PMI to be the best option, together with occasional workarounds for performance bottlenecks. Operations which are slow for a general MultiIndex, like unstack() or swaplevel().sortlevel(), can be sped up for PMIs (see below).

It would be great if we could do something like this more generally, with fast paths for PMIs. We could maybe have MultiIndex.from_product() return a PMI object, which would upcast to MultiIndex when necessary. We could also have stack() and unstack() create PMI objects where possible, and perhaps add an argument to concat() and set_index() to create PMIs. Slow MultiIndex operations could then have a fast path for PMI objects.

Code Sample

import numpy as np
import pandas as pd

m = 100
n = 1000

levels = np.arange(m)
index = pd.MultiIndex.from_product([levels]*2)
columns = np.arange(n)
values = np.arange(m*m*n).reshape(m*m, n)
df = pd.DataFrame(values, index, columns)

# >>> timeit slow_unstack()
# 1 loop, best of 3: 363 ms per loop
def slow_unstack():
    return df.unstack()

# >>> timeit fast_unstack()
# 10 loops, best of 3: 55 ms per loop
def fast_unstack():
    columns = pd.MultiIndex.from_product([df.columns, levels])
    values = df.values.reshape(m, m, n).swapaxes(1, 2).reshape(m, m*n)
    return pd.DataFrame(values, levels, columns)

# >>> timeit slow_swaplevel_sortlevel()
# 1 loop, best of 3: 213 ms per loop
def slow_swaplevel_sortlevel():
    return df.swaplevel().sortlevel()

# >>> timeit fast_swaplevel_sortlevel()
# 10 loops, best of 3: 38.7 ms per loop
def fast_swaplevel_sortlevel():
    values = df.values.reshape(m, m, n).swapaxes(0, 1).reshape(m*m, n)
    index = df.index.swaplevel().sortlevel()[0]
    return pd.DataFrame(values, index, df.columns)

The text was updated successfully, but these errors were encountered:

jreback · 2017-02-25T19:33:28Z

So this is quite a lot of work to actually create a separate MI type of index to do this. The bottleneck is not indexing anyhow. Its the reshaping.

diff --git a/pandas/core/reshape.py b/pandas/core/reshape.py
index 87cb088..64d89bb 100644
--- a/pandas/core/reshape.py
+++ b/pandas/core/reshape.py
@@ -175,6 +175,7 @@ class _Unstacker(object):
         return DataFrame(values, index=index, columns=columns)
 
     def get_new_values(self):
+
         values = self.values
 
         # place the values
@@ -184,23 +185,26 @@ class _Unstacker(object):
         result_shape = (length, result_width)
 
         # if our mask is all True, then we can use our existing dtype
-        if self.mask.all():
-            dtype = values.dtype
-            new_values = np.empty(result_shape, dtype=dtype)
+        if self.mask.all() and len(values):
+            new_values = self.sorted_values.reshape(result_shape)
+            new_mask = np.ones(result_shape, dtype=bool)
         else:
-            dtype, fill_value = _maybe_promote(values.dtype, self.fill_value)
-            new_values = np.empty(result_shape, dtype=dtype)
-            new_values.fill(fill_value)
-
-        new_mask = np.zeros(result_shape, dtype=bool)
+            new_mask = np.zeros(result_shape, dtype=bool)
+            if self.mask.all():
+                dtype = values.dtype
+                new_values = np.empty(result_shape, dtype=dtype)
+            else:
+                dtype, fill_value = _maybe_promote(values.dtype, self.fill_value)
+                new_values = np.empty(result_shape, dtype=dtype)
+                new_values.fill(fill_value)
 
-        # is there a simpler / faster way of doing this?
-        for i in range(values.shape[1]):
-            chunk = new_values[:, i * width:(i + 1) * width]
-            mask_chunk = new_mask[:, i * width:(i + 1) * width]
+            # is there a simpler / faster way of doing this?
+            for i in range(values.shape[1]):
+                chunk = new_values[:, i * width:(i + 1) * width]
+                mask_chunk = new_mask[:, i * width:(i + 1) * width]
 
-            chunk.flat[self.mask] = self.sorted_values[:, i]
-            mask_chunk.flat[self.mask] = True
+                chunk.flat[self.mask] = self.sorted_values[:, i]
+                mask_chunk.flat[self.mask] = True
 
         return new_values, new_mask

makes this about 10x faster, BUT there are several cases that are failing. You are welcome to have a look at seeing if this can pass the test suite.

jreback · 2017-02-25T19:35:08Z

In [1]: m = 100
   ...: n = 1000
   ...: 
   ...: levels = np.arange(m)
   ...: index = pd.MultiIndex.from_product([levels]*2)
   ...: columns = np.arange(n)
   ...: values = np.arange(m*m*n).reshape(m*m, n)
   ...: df = pd.DataFrame(values, index, columns)
   ...: 

In [2]: %timeit df.unstack()
1 loop, best of 3: 289 ms per loop

# with change
In [2]: %timeit df.unstack()
10 loops, best of 3: 33.8 ms per loop

xref pandas-dev#15503

closes pandas-dev#15503

jreback · 2017-02-26T02:27:06Z

@adbull ok done in #15510

(note that my final version is about 2x slower than before), because I have to do multiple reshapings to get things in the correct order. But still about 4x faster.

you generally cannot simply do a reshaping, and esp directly on .values (well you can but you have to do it by dtype; which we already handle internally), and the ordering is a bit tricky.

closes pandas-dev#15503

adbull · 2017-02-26T11:32:07Z

That's awesome, thanks! I was thinking we'd need a separate type to avoid checking the index each time, but I guess that's not an issue.

With df as above, would it be possible to do the same for df.T.stack()? df.swaplevel().sortlevel()?

jreback · 2017-02-26T11:50:04Z

.stack and .sorting are separate issues

why don't u profile a bit and see where the hotspots are

closes pandas-dev#15503

closes pandas-dev#15503 Author: Jeff Reback <[email protected]> Closes pandas-dev#15510 from jreback/reshape3 and squashes the following commits: ec29226 [Jeff Reback] PERF: faster unstacking

jreback added Performance Memory or execution speed performance Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 25, 2017

jreback added a commit to jreback/pandas that referenced this issue Feb 26, 2017

PERF: faster unstacking

c30a125

xref pandas-dev#15503

jreback added a commit to jreback/pandas that referenced this issue Feb 26, 2017

PERF: faster unstacking

6e96dee

xref pandas-dev#15503

jreback added a commit to jreback/pandas that referenced this issue Feb 26, 2017

PERF: faster unstacking

a5f641a

closes pandas-dev#15503

jreback added this to the 0.20.0 milestone Feb 26, 2017

jreback added a commit to jreback/pandas that referenced this issue Feb 26, 2017

PERF: faster unstacking

7a00ffb

closes pandas-dev#15503

jreback mentioned this issue Feb 26, 2017

PERF: faster unstacking #15510

Closed

jreback added a commit to jreback/pandas that referenced this issue Feb 26, 2017

PERF: faster unstacking

1436e1e

closes pandas-dev#15503

jreback added a commit to jreback/pandas that referenced this issue Feb 26, 2017

PERF: faster unstacking

21958b7

closes pandas-dev#15503

jreback added a commit to jreback/pandas that referenced this issue Feb 26, 2017

PERF: faster unstacking

1bfa04c

closes pandas-dev#15503

jreback added a commit to jreback/pandas that referenced this issue Mar 2, 2017

PERF: faster unstacking

ec29226

closes pandas-dev#15503

jreback closed this as completed in 09360d8 Mar 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF/API: fast paths for product MultiIndex? #15503

PERF/API: fast paths for product MultiIndex? #15503

adbull commented Feb 25, 2017 •

edited

Loading

jreback commented Feb 25, 2017

jreback commented Feb 25, 2017

jreback commented Feb 26, 2017

adbull commented Feb 26, 2017

jreback commented Feb 26, 2017

PERF/API: fast paths for product MultiIndex? #15503

PERF/API: fast paths for product MultiIndex? #15503

Comments

adbull commented Feb 25, 2017 • edited Loading

Feature Proposal

Code Sample

jreback commented Feb 25, 2017

jreback commented Feb 25, 2017

jreback commented Feb 26, 2017

adbull commented Feb 26, 2017

jreback commented Feb 26, 2017

adbull commented Feb 25, 2017 •

edited

Loading