Skip to content

PERF: Do not init cache in RangeIndex.take #53397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jul 18, 2023

Conversation

GianlucaFicarelli
Copy link
Contributor

@GianlucaFicarelli GianlucaFicarelli commented May 26, 2023

Improve performance when passing an array to RangeIndex.take, DataFrame.loc, or DataFrame.iloc and the DataFrame is using a RangeIndex

Performance test:

import numpy as np
import pandas as pd

print(f"index creation")
%timeit pd.RangeIndex(100_000_000)

indices = [90_000_000]
print(f"{len(indices)=}, index creation + first execution")
%timeit pd.RangeIndex(100_000_000).take([90_000_000])

print()
ind = pd.RangeIndex(100_000_000)
indices = np.array([90_000_000])
print(f"{len(indices)=}, first execution")
%time ind.take(indices)
print(f"{len(indices)=}, other executions")
%timeit ind.take(indices)

print()
ind = pd.RangeIndex(100_000_000)
rng = np.random.default_rng(0)
indices = rng.integers(0, 100_000_000, 1_000_000)
print(f"{len(indices)=}, first execution")
%time ind.take(indices)
print(f"{len(indices)=}, other executions")
%timeit ind.take(indices)

Output (pandas 2.0.1):

index creation
3.58 µs ± 9.11 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
len(indices)=1, index creation + first execution
204 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

len(indices)=1, first execution
CPU times: user 54.9 ms, sys: 147 ms, total: 202 ms
Wall time: 202 ms
len(indices)=1, other executions
3.79 µs ± 21.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

len(indices)=1000000, first execution
CPU times: user 80.4 ms, sys: 155 ms, total: 236 ms
Wall time: 236 ms
len(indices)=1000000, other executions
28.2 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Output (this PR):

index creation
3.54 µs ± 15.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
len(indices)=1, index creation + first execution
14.6 µs ± 53.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

len(indices)=1, first execution
CPU times: user 38 µs, sys: 0 ns, total: 38 µs
Wall time: 41.5 µs
len(indices)=1, other executions
7.96 µs ± 16.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

len(indices)=1000000, first execution
CPU times: user 2.92 ms, sys: 1.99 ms, total: 4.91 ms
Wall time: 4.93 ms
len(indices)=1000000, other executions
2.17 ms ± 277 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)

@GianlucaFicarelli GianlucaFicarelli force-pushed the range_index_take branch 2 times, most recently from cb5ba30 to 21dd222 Compare May 26, 2023 10:29
@mroeschke mroeschke requested a review from jbrockmendel May 26, 2023 17:10
@mroeschke mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels May 26, 2023
@GianlucaFicarelli GianlucaFicarelli force-pushed the range_index_take branch 4 times, most recently from 0b4497d to 964a190 Compare May 30, 2023 15:47
@github-actions
Copy link
Contributor

github-actions bot commented Jul 1, 2023

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 1, 2023
@GianlucaFicarelli GianlucaFicarelli force-pushed the range_index_take branch 3 times, most recently from bbe2658 to 155e572 Compare July 3, 2023 13:19
@GianlucaFicarelli GianlucaFicarelli force-pushed the range_index_take branch 2 times, most recently from 7ff5a04 to 1def1ee Compare July 12, 2023 15:31
@jbrockmendel
Copy link
Member

Couple questions, otherwise LGTM

@GianlucaFicarelli
Copy link
Contributor Author

Couple questions, otherwise LGTM

Thanks, I've answered all the pending questions.

@jbrockmendel jbrockmendel merged commit fd43d4b into pandas-dev:main Jul 18, 2023
@jbrockmendel
Copy link
Member

thanks @GianlucaFicarelli

@mroeschke mroeschke added this to the 2.1 milestone Jul 18, 2023
@GianlucaFicarelli GianlucaFicarelli deleted the range_index_take branch July 19, 2023 07:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: RangeIndex cache is written when calling df.loc[array]
3 participants