Skip to content

PERF: concatenation of MultiIndexed objects (MultiIndex.append) #53697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 20, 2023

Conversation

lukemanley
Copy link
Member

cc: @jorisvandenbossche - nice find/suggestion.

Perf improvement for MultiIndex.append (and pd.concat for objects with MultiIndexes):

       before           after         ratio
       <main>          <mi-append>
-        67.3±2ms       5.09±0.1ms     0.08  multiindex_object.Append.time_append('datetime64[ns]')
-        68.5±1ms       5.10±0.1ms     0.07  multiindex_object.Append.time_append('int64')
-         147±5ms       5.40±0.3ms     0.04  multiindex_object.Append.time_append('string')

@lukemanley lukemanley added Performance Memory or execution speed performance MultiIndex labels Jun 16, 2023
@lukemanley lukemanley added this to the 2.1 milestone Jun 16, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up! And it seems that all tests pass, so that's nice ;)

Do you think it would be worth testing this with more MIs? (not necessarily in the asvs, can also be just here in the PR) Or are we confident this should be faster (or at least not a big slowdown) for most cases?

Comment on lines 2160 to 2163
mi.codes[i],
mi.levels[i],
level_values,
copy=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mi.codes[i],
mi.levels[i],
level_values,
copy=False,
mi.codes[i], mi.levels[i], level_values, copy=False

(styling nitpick, this can fit on one line I think)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks

@jorisvandenbossche
Copy link
Member

Do you think it would be worth testing this with more MIs? (not necessarily in the asvs, can also be just here in the PR) Or are we confident this should be faster (or at least not a big slowdown) for most cases?

Did a quick test with a case of complete unique values in the MI levels (so not even repetition within a level):

N = 100_000
idx1 = pd.MultiIndex.from_arrays([range(N), np.random.choice(list(string.ascii_letters), size=(N, 5)).astype(object).sum(axis=1)])
idx2 = pd.MultiIndex.from_arrays([range(N, N*2), np.random.choice(list(string.ascii_letters), size=(N, 5)).astype(object).sum(axis=1)])
%timeit idx1.append(idx2)

and even here the new algo is faster (just a smaller difference, "only" around 2x faster).
My gut feeling says that such a case of all unique level values is the "worse" case for the recoding-based implementation, so that seems to indicate we can assume this to be faster generally (in the end, also with the current implementation of combining the (unique) level values, those have to be encoded afterwards anyway, so recoding shouldn't be worse here).

@mroeschke mroeschke merged commit f989e1b into pandas-dev:main Jun 20, 2023
@mroeschke
Copy link
Member

Very nice thanks @lukemanley

@lukemanley lukemanley deleted the mi-append branch June 22, 2023 21:59
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
@phofl
Copy link
Member

phofl commented Jul 12, 2023

@lukemanley could you add a test like in the linked dask issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: potential room for optimizing concatenation of MultiIndex (MultiIndex.append)
4 participants