PERF: potential room for optimizing concatenation of MultiIndex (MultiIndex.append) #53574

jorisvandenbossche · 2023-06-09T12:30:52Z

While discussing #53515, I noticed that specifically the concatenation of the index is showing up in an example implementation of stack(), in this case concatting MultiIndexes. That made me wonder why this operation is relatively slow, and if there would be ways to speed this up. What follows here is a short report of this exploration.

Test case: combining two MultiIndex objects:

idx1 = pd.MultiIndex.from_product([range(1000), range(100), ['a']])
idx2 = pd.MultiIndex.from_product([range(1000), range(100), ['b']])

result = idx1.append(idx2)

This operation essentially "densifies" the levels of the MultiIndex (get_level_values), appends them, and then re-encodes those dense values into codes/levels with MultiIndex.from_arrays (code).
Below is a small toy implementation of concatting two MultiIndex objects (with same number of levels) with never fully densifying the level values, but only "recoding" the existing codes to new labels:

from pandas.core.arrays.categorical import recode_for_categories

def concat_multi_index(idx1, idx2):
    codes = []
    levels = []
    for level in range(idx1.nlevels):
        level_values = idx1.levels[level].union(idx2.levels[level])
        codes1 = recode_for_categories(idx1.codes[level], idx1.levels[level], level_values, copy=False)
        codes2 = recode_for_categories(idx2.codes[level], idx2.levels[level], level_values, copy=False)
        codes.append(np.concatenate([codes1, codes2]))
        levels.append(level_values)
    return pd.MultiIndex(codes=codes, levels=levels, names=idx1.names, verify_integrity=False)

This gives the same result, and a significant (20x) speed-up for this case:

In [51]: res1 = idx1.append(idx2)

In [52]: res2 = concat_multi_index(idx1, idx2)

In [53]: res1.equals(res2)
Out[53]: True

In [54]: %timeit idx1.append(idx2)
10.7 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [55]: %timeit concat_multi_index(idx1, idx2)
466 µs ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(run with current main branch)

A bunch of caveats:

This is not a generic implementation (currently assumes you have matching levels, only for two inputs (while append can take a list), etc), and thus would need some more (which probably also adds some extra overhead) to generalize it. I also didn't test it, so there might be various corner cases that we have to handle that I am overlooking here (a good next step would be to run our test suite with this ..).
I only benchmarked it for this very specific case. It would be good to test it with some other MultiIndexes with different characteristics (number of unique level values, how much overlap in level values between the different indices, etc), to verify if it can be faster more in general, or if it was specific to the data I was testing with.

I currently don't have time to further dig in and make this into a PR myself, so if someone else is interested, feel free to take it up!

The text was updated successfully, but these errors were encountered:

jorisvandenbossche added Performance Memory or execution speed performance MultiIndex labels Jun 9, 2023

jorisvandenbossche mentioned this issue Jun 9, 2023

DEPR/BUG: DataFrame.stack including all null rows when stacking multiple levels #53515

Closed

lukemanley mentioned this issue Jun 16, 2023

PERF: concatenation of MultiIndexed objects (MultiIndex.append) #53697

Merged

5 tasks

mroeschke closed this as completed in #53697 Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: potential room for optimizing concatenation of MultiIndex (MultiIndex.append) #53574

PERF: potential room for optimizing concatenation of MultiIndex (MultiIndex.append) #53574

jorisvandenbossche commented Jun 9, 2023

PERF: potential room for optimizing concatenation of MultiIndex (MultiIndex.append) #53574

PERF: potential room for optimizing concatenation of MultiIndex (MultiIndex.append) #53574

Comments

jorisvandenbossche commented Jun 9, 2023