Skip to content

PERF: specialized 1D take version #40246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Mar 5, 2021

A part that I took out of #39692, doing it separately here.

This gives a bit of code duplication (but all large parts of the function were already factored out into shared helper functions before, like _take_preprocess_indexer_and_fill_value), but having a 1D version gives less overhead compared to take_nd.

Using the same benchmark case (unstack ASV case) as in #39692, this gives:

arrays = [np.arange(100).repeat(100), np.roll(np.tile(np.arange(100), 100), 25)]
index = pd.MultiIndex.from_arrays(arrays)
df = pd.DataFrame(np.random.randn(10000, 4), index=index)
df_am = df._as_manager("array")
In [3]:  %timeit df_am.unstack()
3.31 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- master
2.95 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <-- PR

It's a small difference (around 10% improvement), but I repeated the timings above multiple times it was quite consistently a bit faster.

@jorisvandenbossche jorisvandenbossche added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Performance Memory or execution speed performance labels Mar 5, 2021
@jreback jreback added this to the 1.3 milestone Mar 5, 2021
@jreback
Copy link
Contributor

jreback commented Mar 5, 2021

This gives a bit of code duplication (but all large parts of the function were already factored out into shared helper functions before, like _take_preprocess_indexer_and_fill_value), but having a 1D version gives less overhead compared to take_nd.

can you factor even more, this ultimately should be used in the other functions.

@jorisvandenbossche
Copy link
Member Author

I don't think more is easily possible. In the existing take_nd, there are several places with additional checks related to 2D, and I don't want to add that back / share that in a common helper, because it's exactly the purpose of this specialized 1D version to not have to care about (and have overhead due) 2D handling.

@jorisvandenbossche
Copy link
Member Author

this ultimately should be used in the other functions.

It is used in other functions, eg in unstack (added here, and now the other PR is merged, can also use it in reindex)

indexer: np.ndarray,
fill_value=None,
allow_fill: bool = True,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> ArrayLike?

@jreback
Copy link
Contributor

jreback commented Mar 5, 2021

pandas/core/internals/array_manager.py:997: error: Name 'take_nd' is not defined [name-defined]

@jorisvandenbossche
Copy link
Member Author

Remaining failure also happens on master.

@jorisvandenbossche jorisvandenbossche merged commit 85dd837 into pandas-dev:master Mar 5, 2021
@jorisvandenbossche jorisvandenbossche deleted the am-perf-take-1d branch March 5, 2021 21:34
@jbrockmendel
Copy link
Member

@jorisvandenbossche pls resist the temptation to self-merge. i'll sleep better.

(no complaints about the PRs you've merged recently, just as you've commented elsewhere about premature merging, better to have good habits and stay on the safe side)

can this new function be used anywhere else, e.g. in Series/Index?

@jorisvandenbossche
Copy link
Member Author

To be clear, I only do that for PRs that are trivial and/or have been reviewed and are approved or have no active discussion (and thus not for eg PRs that impact the API). Of course that's always a bit subjective, so I have no problem in being called out for in case you think I made a wrong judgement call. And we are using git, so it's always easy to revert something if needed.

can this new function be used anywhere else, e.g. in Series/Index?

Maybe? That would need to be investigated for each individual case whether the constraints are met (eg the input already validated etc). And I am currently not planning to do such investigation (in general, unless it's within a performance bottleneck, I would personally mostly limit the use of this function to the internals.

@jbrockmendel
Copy link
Member

in case you think I made a wrong judgement call

All the case so far i think have been benign.

have been reviewed and are approved

In practice I doubt it will make a difference, but I will be moderately more comfortable if you ping me/jeff. On the other hand if we're not reasonably responsive (timezones notwithstanding), do let me know.

Really the failure mode that I'm most worried about is something slipping through the cracks of my inbox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants