Skip to content

PERF: potential room for optimizing concatenation of MultiIndex (MultiIndex.append) #53574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Jun 9, 2023 · 0 comments · Fixed by #53697
Labels
MultiIndex Performance Memory or execution speed performance

Comments

@jorisvandenbossche
Copy link
Member

While discussing #53515, I noticed that specifically the concatenation of the index is showing up in an example implementation of stack(), in this case concatting MultiIndexes. That made me wonder why this operation is relatively slow, and if there would be ways to speed this up. What follows here is a short report of this exploration.

Test case: combining two MultiIndex objects:

idx1 = pd.MultiIndex.from_product([range(1000), range(100), ['a']])
idx2 = pd.MultiIndex.from_product([range(1000), range(100), ['b']])

result = idx1.append(idx2)

This operation essentially "densifies" the levels of the MultiIndex (get_level_values), appends them, and then re-encodes those dense values into codes/levels with MultiIndex.from_arrays (code).
Below is a small toy implementation of concatting two MultiIndex objects (with same number of levels) with never fully densifying the level values, but only "recoding" the existing codes to new labels:

from pandas.core.arrays.categorical import recode_for_categories

def concat_multi_index(idx1, idx2):
    codes = []
    levels = []
    for level in range(idx1.nlevels):
        level_values = idx1.levels[level].union(idx2.levels[level])
        codes1 = recode_for_categories(idx1.codes[level], idx1.levels[level], level_values, copy=False)
        codes2 = recode_for_categories(idx2.codes[level], idx2.levels[level], level_values, copy=False)
        codes.append(np.concatenate([codes1, codes2]))
        levels.append(level_values)
    return pd.MultiIndex(codes=codes, levels=levels, names=idx1.names, verify_integrity=False)

This gives the same result, and a significant (20x) speed-up for this case:

In [51]: res1 = idx1.append(idx2)

In [52]: res2 = concat_multi_index(idx1, idx2)

In [53]: res1.equals(res2)
Out[53]: True

In [54]: %timeit idx1.append(idx2)
10.7 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [55]: %timeit concat_multi_index(idx1, idx2)
466 µs ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

(run with current main branch)

A bunch of caveats:

  • This is not a generic implementation (currently assumes you have matching levels, only for two inputs (while append can take a list), etc), and thus would need some more (which probably also adds some extra overhead) to generalize it. I also didn't test it, so there might be various corner cases that we have to handle that I am overlooking here (a good next step would be to run our test suite with this ..).
  • I only benchmarked it for this very specific case. It would be good to test it with some other MultiIndexes with different characteristics (number of unique level values, how much overlap in level values between the different indices, etc), to verify if it can be faster more in general, or if it was specific to the data I was testing with.

I currently don't have time to further dig in and make this into a PR myself, so if someone else is interested, feel free to take it up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant