Multi-index and CategoricalIndex performance #22044

0x0L · 2018-07-24T21:10:14Z

Hi,

Building a multi-index from a categorical index should be instantaneous: labels are codes and the corresponding level is a CategoricalIndex with the same N categories and codes [0 ... N-1] as intended in

pandas/pandas/core/arrays/categorical.py

Line 2510 in edb71fd

categories = CategoricalIndex(values.categories,

Unfortunately that call is painfully slow

import pandas as pd
import numpy as np

pd.__version__
# 0.23.3

x = np.linspace(-1, 1, 1_000_000)
i = pd.Index(x)
j = i.astype('category')

%timeit pd.MultiIndex.from_arrays([j])
180 ms ± 912 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The line above from categorical.py is the first bottleneck. It takes 130 ms in this example.
We could go around an replace that line with something along

indexer = np.arange(len(values.categories), dtype=values.codes.dtype)
categories = pd.CategoricalIndex([], fastpath=True).set_categories(values.categories)
categories._data._codes = indexer
categories._cleanup()

On my machine, this takes ~2 ms

The 50 missing milliseconds are spent calling _shallow_copy in MultiIndex._set_levels. That copy may not be that shallow: looks like this boils down to yet another variation on the CategoricalIndex

%timeit categories._shallow_copy()
# 46 ms ± 763 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.CategoricalIndex(categories, categories.categories, categories.ordered)
# 47.2 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note however that

%timeit pd.CategoricalIndex(categories)
# 3.69 µs ± 2.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I would have liked to make PR but I don't fully understand the code and all that class overloading

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-07-24T21:13:34Z

Building a multi-index from a categorical index should be instantaneous

Is this true only when building a MutliIndex with a single level?

0x0L · 2018-07-24T21:15:38Z

Levels are independent so this is true of any CategoricalIndex in a MultiIndex. See

pandas/pandas/core/arrays/categorical.py

Line 2521 in edb71fd

def _factorize_from_iterables(iterables):

which is where the work is done when calling pd.MultiIndex.from_arrays

shenker · 2018-09-02T02:35:28Z

Just wanted to chime in that this seems to be the main performance bottleneck for me when reading in string index columns from Parquet (using, e.g., pyarrow).

Because they're dictionary-encoded on disk, they're efficiently read in as Categorical columns; calling set_index on this single column (to make a CategoricalIndex) or set_index with a string column and a few integer columns (to make a MultiIndex) takes 1-2min for 250M rows, which is longer than it actually takes to read the data from disk.

0x0L mentioned this issue Jul 24, 2018

MultiIndex.levels become duplicated and fail indexing when using fastparquet dask/fastparquet#349

Closed

gfyoung added Performance Memory or execution speed performance MultiIndex labels Jul 27, 2018

0x0L mentioned this issue Jun 7, 2019

PERF: building MultiIndex with categorical levels #26721

Merged

4 tasks

jreback added this to the 0.25.0 milestone Jun 8, 2019

jreback closed this as completed in #26721 Jun 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-index and CategoricalIndex performance #22044

Multi-index and CategoricalIndex performance #22044

0x0L commented Jul 24, 2018

TomAugspurger commented Jul 24, 2018

0x0L commented Jul 24, 2018 •

edited

Loading

shenker commented Sep 2, 2018

Multi-index and CategoricalIndex performance #22044

Multi-index and CategoricalIndex performance #22044

Comments

0x0L commented Jul 24, 2018

TomAugspurger commented Jul 24, 2018

0x0L commented Jul 24, 2018 • edited Loading

shenker commented Sep 2, 2018

0x0L commented Jul 24, 2018 •

edited

Loading