Skip to content

Multi-index and CategoricalIndex performance #22044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
0x0L opened this issue Jul 24, 2018 · 3 comments · Fixed by #26721
Closed

Multi-index and CategoricalIndex performance #22044

0x0L opened this issue Jul 24, 2018 · 3 comments · Fixed by #26721
Labels
MultiIndex Performance Memory or execution speed performance
Milestone

Comments

@0x0L
Copy link
Contributor

0x0L commented Jul 24, 2018

Hi,

Building a multi-index from a categorical index should be instantaneous: labels are codes and the corresponding level is a CategoricalIndex with the same N categories and codes [0 ... N-1] as intended in

categories = CategoricalIndex(values.categories,

Unfortunately that call is painfully slow

import pandas as pd
import numpy as np

pd.__version__
# 0.23.3

x = np.linspace(-1, 1, 1_000_000)
i = pd.Index(x)
j = i.astype('category')

%timeit pd.MultiIndex.from_arrays([j])
180 ms ± 912 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The line above from categorical.py is the first bottleneck. It takes 130 ms in this example.
We could go around an replace that line with something along

indexer = np.arange(len(values.categories), dtype=values.codes.dtype)
categories = pd.CategoricalIndex([], fastpath=True).set_categories(values.categories)
categories._data._codes = indexer
categories._cleanup()

On my machine, this takes ~2 ms

The 50 missing milliseconds are spent calling _shallow_copy in MultiIndex._set_levels. That copy may not be that shallow: looks like this boils down to yet another variation on the CategoricalIndex

%timeit categories._shallow_copy()
# 46 ms ± 763 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit pd.CategoricalIndex(categories, categories.categories, categories.ordered)
# 47.2 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note however that

%timeit pd.CategoricalIndex(categories)
# 3.69 µs ± 2.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I would have liked to make PR but I don't fully understand the code and all that class overloading

@TomAugspurger
Copy link
Contributor

Building a multi-index from a categorical index should be instantaneous

Is this true only when building a MutliIndex with a single level?

@0x0L
Copy link
Contributor Author

0x0L commented Jul 24, 2018

Levels are independent so this is true of any CategoricalIndex in a MultiIndex. See

def _factorize_from_iterables(iterables):
which is where the work is done when calling pd.MultiIndex.from_arrays

@gfyoung gfyoung added Performance Memory or execution speed performance MultiIndex labels Jul 27, 2018
@shenker
Copy link

shenker commented Sep 2, 2018

Just wanted to chime in that this seems to be the main performance bottleneck for me when reading in string index columns from Parquet (using, e.g., pyarrow).

Because they're dictionary-encoded on disk, they're efficiently read in as Categorical columns; calling set_index on this single column (to make a CategoricalIndex) or set_index with a string column and a few integer columns (to make a MultiIndex) takes 1-2min for 250M rows, which is longer than it actually takes to read the data from disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants