Skip to content

PERF: Avoid materializing entire IntervalIndex when using cut #27668

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jschendel opened this issue Jul 31, 2019 · 0 comments · Fixed by #27669
Closed

PERF: Avoid materializing entire IntervalIndex when using cut #27668

jschendel opened this issue Jul 31, 2019 · 0 comments · Fixed by #27669
Labels
Interval Interval data type Performance Memory or execution speed performance
Milestone

Comments

@jschendel
Copy link
Member

jschendel commented Jul 31, 2019

When using cut with an IntervalIndex for bins the result of the cut is first materialized as an IntervalIndex and then converted to a Categorical:

if isinstance(bins, IntervalIndex):
# we have a fast-path here
ids = bins.get_indexer(x)
result = algos.take_nd(bins, ids)
result = Categorical(result, categories=bins, ordered=True)
return result, bins

It seems like it'd be more performant from a computational and memory standpoint to bypass the intermediate construction of an IntervalIndex via take_nd and instead directly construct the Categorical via Categorical.from_codes.

Some ad hoc measurements on master:

In [3]: ii = pd.interval_range(0, 20)

In [4]: values = np.linspace(0, 20, 100).repeat(10**4)

In [5]: %timeit pd.cut(values, ii)
7.69 s ± 43.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %memit pd.cut(values, ii)
peak memory: 278.39 MiB, increment: 130.76 MiB

And the same measurements with the Categorical.from_codes fix:

In [3]: ii = pd.interval_range(0, 20)

In [4]: values = np.linspace(0, 20, 100).repeat(10**4)

In [5]: %timeit pd.cut(values, ii)
1.02 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %memit pd.cut(values, ii)
peak memory: 145.81 MiB, increment: 15.98 MiB
@jschendel jschendel added Performance Memory or execution speed performance Interval Interval data type labels Jul 31, 2019
@jschendel jschendel added this to the 1.0 milestone Jul 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant