PERF: IndexEngine.get_indexer_non_unique #55816

lukemanley · 2023-11-03T22:16:12Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/v2.2.0.rst file if fixing a bug or adding a new feature.

maybe(?) closes #15364

When resizing result array, grow by factor of 2 rather than fixed amount.

import pandas as pd
import numpy as np

idx = pd.Index(np.ones(1_000_000))
target = pd.Index([1])

%timeit idx.get_indexer_for(target)

# 968 ms ± 35.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)     -> main
# 73.1 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  -> PR

import pandas as pd
import numpy as np

idx = pd.Index(np.arange(1_000_000).tolist() + [0])
target = idx[:-1]

%timeit idx.get_indexer_for(target)

2.17 s ± 22.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  -> main
1.18 s ± 27.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  -> PR

mroeschke · 2023-11-04T00:31:48Z

How would performance be if scaling followed 10000 + 2^n where 2^n was about 10000 on the first iteration?

lukemanley · 2023-11-04T02:09:53Z

How would performance be if scaling followed 10000 + 2^n where 2^n was about 10000 on the first iteration?

I tried 10000 + 2^n starting with n=13 (8192) which produced the following timings for the two examples above:

79.7 ms ± 3.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.34 s ± 71.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pretty close to the timings in this PR. Let me know if you prefer to use this.

mroeschke · 2023-11-04T22:15:49Z

If the resizing is the expensive operation here, I think this would be generically more performant for larger cases too so yes

lukemanley · 2023-11-04T23:34:05Z

If I understand correctly, I think your suggestion is very close to the current PR as they both end up growing by a factor of 2:

In [1]: import pandas as pd

In [2]: pd.DataFrame(
   ...:     {
   ...:         "main": [(n+1) * 10_000 for n in range(10)],
   ...:         "PR": [10_000 * 2**n for n in range(10)],
   ...:         "proposed": [10_000] + [10_000 + 2**(13+n) for n in range(9)]
   ...:     }
   ...: )
Out[2]: 
     main       PR  proposed
0   10000    10000     10000
1   20000    20000     18192
2   30000    40000     26384
3   40000    80000     42768
4   50000   160000     75536
5   60000   320000    141072
6   70000   640000    272144
7   80000  1280000    534288
8   90000  2560000   1058576
9  100000  5120000   2107152

Am I understanding your suggestion correctly?

WillAyd

lgtm. I don't think the heuristic for this will make too much of a difference. See also https://stackoverflow.com/questions/3190146/is-it-better-to-allocate-memory-in-the-power-of-two

mroeschke

Ah I see. Yeah what you have is sufficient in this case (and less complex than what I proposed)

mroeschke · 2023-11-06T17:27:26Z

Thanks @lukemanley

resize array by factor of 2

578dc3a

lukemanley added Performance Memory or execution speed performance Index Related to the Index class or subclasses labels Nov 3, 2023

lukemanley added this to the 2.2 milestone Nov 3, 2023

lukemanley requested a review from WillAyd as a code owner November 3, 2023 22:16

whatsnew

5eeac80

WillAyd approved these changes Nov 6, 2023

View reviewed changes

mroeschke approved these changes Nov 6, 2023

View reviewed changes

mroeschke merged commit 5d82d8b into pandas-dev:main Nov 6, 2023

lukemanley deleted the get-indexer-non-unique branch November 16, 2023 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: IndexEngine.get_indexer_non_unique #55816

PERF: IndexEngine.get_indexer_non_unique #55816

lukemanley commented Nov 3, 2023 •

edited

Loading

mroeschke commented Nov 4, 2023

lukemanley commented Nov 4, 2023

mroeschke commented Nov 4, 2023 •

edited

Loading

lukemanley commented Nov 4, 2023 •

edited

Loading

WillAyd left a comment

mroeschke left a comment

mroeschke commented Nov 6, 2023

PERF: IndexEngine.get_indexer_non_unique #55816

PERF: IndexEngine.get_indexer_non_unique #55816

Conversation

lukemanley commented Nov 3, 2023 • edited Loading

mroeschke commented Nov 4, 2023

lukemanley commented Nov 4, 2023

mroeschke commented Nov 4, 2023 • edited Loading

lukemanley commented Nov 4, 2023 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke commented Nov 6, 2023

lukemanley commented Nov 3, 2023 •

edited

Loading

mroeschke commented Nov 4, 2023 •

edited

Loading

lukemanley commented Nov 4, 2023 •

edited

Loading