PERF: don't sort data twice in groupby apply when not using libreduction fast_apply #40176

jorisvandenbossche · 2021-03-02T19:34:57Z

See #40171 (comment) for context, noticed that we were calling splitter._get_sorted_data() twice when using the non-fast_apply fallback.

Using the benchmark case from groupby.Apply.time_scalar_function_single/multi_col (like in #40171 (comment)), but then with bigger data (10 ** 6 instead of 10 ** 4):

N = 10 ** 6
labels = np.random.randint(0, 2000, size=N)
labels2 = np.random.randint(0, 3, size=N)
df = DataFrame(
    {
        "key": labels,
        "key2": labels2,
        "value1": np.random.randn(N),
        "value2": ["foo", "bar", "baz", "qux"] * (N // 4),
    }
)
df_am = df._as_manager("array")

I get

In [2]: %timeit df_am.groupby("key").apply(lambda x: 1)
252 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  <-- master
166 ms ± 5.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- PR

…ion fast_apply

jreback · 2021-03-02T19:36:04Z

cool. can you add a whatsnew note about perf :->

jreback · 2021-03-02T21:34:34Z

thanks @jorisvandenbossche (prob with future PRs can just add the issue numbers onto this perf for group note, but up 2 you)

PERF: don't sort data twice in groupby apply when not using libreduct…

3a6b273

…ion fast_apply

jorisvandenbossche added Groupby Performance Memory or execution speed performance labels Mar 2, 2021

jreback added this to the 1.3 milestone Mar 2, 2021

jorisvandenbossche added 2 commits March 2, 2021 20:47

expand existing benchmark to cover case with larger groups

6bd9c9d

add whatsnew

523cf5f

jreback merged commit 9fdb8f6 into pandas-dev:master Mar 2, 2021

jorisvandenbossche deleted the perf-groupby-apply-pyfallback branch March 2, 2021 21:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: don't sort data twice in groupby apply when not using libreduction fast_apply #40176

PERF: don't sort data twice in groupby apply when not using libreduction fast_apply #40176

jorisvandenbossche commented Mar 2, 2021

jreback commented Mar 2, 2021

jreback commented Mar 2, 2021

PERF: don't sort data twice in groupby apply when not using libreduction fast_apply #40176

PERF: don't sort data twice in groupby apply when not using libreduction fast_apply #40176

Conversation

jorisvandenbossche commented Mar 2, 2021

jreback commented Mar 2, 2021

jreback commented Mar 2, 2021