Skip to content

PERF: don't sort data twice in groupby apply when not using libreduction fast_apply #40176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

See #40171 (comment) for context, noticed that we were calling splitter._get_sorted_data() twice when using the non-fast_apply fallback.

Using the benchmark case from groupby.Apply.time_scalar_function_single/multi_col (like in #40171 (comment)), but then with bigger data (10 ** 6 instead of 10 ** 4):

N = 10 ** 6
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)
df_am = df._as_manager("array")

I get

In [2]: %timeit df_am.groupby("key").apply(lambda x: 1)
252 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- master
166 ms ± 5.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- PR

@jorisvandenbossche jorisvandenbossche added Groupby Performance Memory or execution speed performance labels Mar 2, 2021
@jreback jreback added this to the 1.3 milestone Mar 2, 2021
@jreback
Copy link
Contributor

jreback commented Mar 2, 2021

cool. can you add a whatsnew note about perf :->

@jreback jreback merged commit 9fdb8f6 into pandas-dev:master Mar 2, 2021
@jreback
Copy link
Contributor

jreback commented Mar 2, 2021

thanks @jorisvandenbossche (prob with future PRs can just add the issue numbers onto this perf for group note, but up 2 you)

@jorisvandenbossche jorisvandenbossche deleted the perf-groupby-apply-pyfallback branch March 2, 2021 21:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants