Skip to content

PERF: reindex with MultiIndex #46235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 5, 2022

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Mar 4, 2022

Avoid unnecessary (and potentially expensive) call to MultiIndex._values. This only matters when MultiIndex._values has not been cached. Existing asv has been updated to test cached and non-cached cases.

import numpy as np
import pandas as pd

mi = pd.MultiIndex.from_arrays(
  [np.arange(10**6)] * 2
)
mi2 = pd.MultiIndex.from_arrays(
  [np.arange(10**7)] * 2
)

df = pd.DataFrame({"A": 1.0}, index=mi)

%timeit x = df.reindex(mi2.copy())
1.84 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)   <- main
685 ms ± 22.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <- PR

@lukemanley lukemanley added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance labels Mar 4, 2022
@@ -3922,14 +3922,15 @@ def _get_indexer(
elif method == "nearest":
indexer = self._get_nearest_indexer(target, limit, tolerance)
else:
tgt_values = target._get_engine_target()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@jreback jreback added this to the 1.5 milestone Mar 5, 2022
@jreback jreback merged commit cefafd4 into pandas-dev:main Mar 5, 2022
@jreback
Copy link
Contributor

jreback commented Mar 5, 2022

very nice @lukemanley

@lukemanley lukemanley deleted the reindex-with-multiindex branch March 5, 2022 12:59
@twoertwein
Copy link
Member

twoertwein commented Mar 5, 2022

I think this PR made pyright unhappy, but I honestly don't see how this PR affected _mixins.py:

pandas/core/arrays/_mixins.py:185:26 - warning: Instance methods should take a "self" parameter (reportSelfClsParameterName)

If the solution is to ignore this warning, you can add
# pyright: reportSelfClsParameterName = false
at the top of the file.

edit:
or rename cls to self:

def _from_factorized(cls, values, original):

@jreback
Copy link
Contributor

jreback commented Mar 5, 2022

i think was a prior PR

@twoertwein
Copy link
Member

i think was a prior PR

Yes, you are right: #46214 If no one beats me, I will create a PR to rename cls to self

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants