Skip to content

PERF: IndexEngine.get_indexer_non_unique #55816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 6, 2023

Conversation

lukemanley
Copy link
Member

@lukemanley lukemanley commented Nov 3, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/v2.2.0.rst file if fixing a bug or adding a new feature.

maybe(?) closes #15364

When resizing result array, grow by factor of 2 rather than fixed amount.

import pandas as pd
import numpy as np

idx = pd.Index(np.ones(1_000_000))
target = pd.Index([1])

%timeit idx.get_indexer_for(target)

# 968 ms ± 35.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     -> main
# 73.1 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  -> PR
import pandas as pd
import numpy as np

idx = pd.Index(np.arange(1_000_000).tolist() + [0])
target = idx[:-1]

%timeit idx.get_indexer_for(target)

2.17 s ± 22.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  -> main
1.18 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  -> PR

@lukemanley lukemanley added Performance Memory or execution speed performance Index Related to the Index class or subclasses labels Nov 3, 2023
@lukemanley lukemanley added this to the 2.2 milestone Nov 3, 2023
@lukemanley lukemanley requested a review from WillAyd as a code owner November 3, 2023 22:16
@mroeschke
Copy link
Member

How would performance be if scaling followed 10000 + 2^n where 2^n was about 10000 on the first iteration?

@lukemanley
Copy link
Member Author

How would performance be if scaling followed 10000 + 2^n where 2^n was about 10000 on the first iteration?

I tried 10000 + 2^n starting with n=13 (8192) which produced the following timings for the two examples above:

79.7 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.34 s ± 71.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pretty close to the timings in this PR. Let me know if you prefer to use this.

@mroeschke
Copy link
Member

mroeschke commented Nov 4, 2023

If the resizing is the expensive operation here, I think this would be generically more performant for larger cases too so yes

@lukemanley
Copy link
Member Author

lukemanley commented Nov 4, 2023

If I understand correctly, I think your suggestion is very close to the current PR as they both end up growing by a factor of 2:

In [1]: import pandas as pd

In [2]: pd.DataFrame(
   ...:     {
   ...:         "main": [(n+1) * 10_000 for n in range(10)],
   ...:         "PR": [10_000 * 2**n for n in range(10)],
   ...:         "proposed": [10_000] + [10_000 + 2**(13+n) for n in range(9)]
   ...:     }
   ...: )
Out[2]: 
     main       PR  proposed
0   10000    10000     10000
1   20000    20000     18192
2   30000    40000     26384
3   40000    80000     42768
4   50000   160000     75536
5   60000   320000    141072
6   70000   640000    272144
7   80000  1280000    534288
8   90000  2560000   1058576
9  100000  5120000   2107152

Am I understanding your suggestion correctly?

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. I don't think the heuristic for this will make too much of a difference. See also https://stackoverflow.com/questions/3190146/is-it-better-to-allocate-memory-in-the-power-of-two

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Yeah what you have is sufficient in this case (and less complex than what I proposed)

@mroeschke mroeschke merged commit 5d82d8b into pandas-dev:main Nov 6, 2023
@mroeschke
Copy link
Member

Thanks @lukemanley

@lukemanley lukemanley deleted the get-indexer-non-unique branch November 16, 2023 12:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Huge performance hit in get_indexer_non_unique
3 participants