PERF: groupby iteration #51109

jbrockmendel · 2023-02-01T17:23:10Z

closes #xxxx (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Lookups on ._constructor and finalize calls add up when iterating over many groups. This does a pre-iteration check to see if we can skip those lookups/calls.

Using the example from #46505

import string
import random

import pandas as pd
import numpy as np
from time import time

rows = 500000
temp = np.random.randint(10**6, size=(rows, 3))
symbols = string.ascii_uppercase + string.digits
string_col = [''.join(random.choices(symbols, k=16)) for _ in range(rows)]
res = np.concatenate((temp, np.array([string_col]).T), axis=1)

df = pd.DataFrame(res)

def _format(x):
    vals = x.values
    if len(vals) > 2:
        return '-'.join(map(str, vals[:-1]))
    return np.nan


gb = df.groupby(3, sort=False)
func = {"col1": (2, _format), "col2": (1, _format)}
%prun -s cumtime gb.agg(**func)

On main this takes 31.655s of which 21.520 are in _chop and 8.357 are in __finalize__. In this PR this takes 20.328s of which 11.237 are in _chop_fast. (still slower than 1.3.5 which was 6.553s

There's a reasonable argument to be made that having separate paths for subclasses is not ideal, which I'll leave to @jorisvandenbossche to make if he's interested.

WillAyd · 2023-02-04T01:17:06Z

Without knowing the context of the regression I'm -.05 to doing this - I think it gets a little wonky over time to have overridden constructors like this scattered through the codebase

jorisvandenbossche · 2023-03-04T13:27:32Z

Technically, this certainly looks fine (in theory, we might want to also checks flags in addition to attrs, i.e. that it has the default values). It's indeed mostly a trade-off about whether this performance improvement is worth the added complexity (personally I don't think this is that complex).

But for the timings, can you do those again with actual %timeit instead of %prun. Running with the profiler adds a generic overhead, and so can magnify differences. If #51784 turns out to be realistic, it might also be useful to see what this PR still gives in improvement on top of that.

jbrockmendel · 2023-03-04T19:58:57Z

can you do those again with actual %timeit instead of %prun

In [2]: %timeit gb.agg(**func)
17.2 s ± 969 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- main
9.93 s ± 268 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- PR

we might want to also checks flags

Good point, will update.

Also can improve the Series case a little bit by pinning _name instead of passing name=ser.name, analogous to #51784.

jorisvandenbossche · 2023-03-05T07:54:34Z

Also can improve the Series case a little bit by pinning _name instead of passing name=ser.name, analogous to #51784.

I don't think that's needed, the fastpath=True should already give the same effect IIRC.

jbrockmendel · 2023-03-24T20:29:13Z

#51784 improved the performance on main on the order of 25%. Updated timings:

In [3]: %timeit gb.agg(**func)
13.5 s ± 472 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- main
10.3 s ± 334 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <- PR

Makes this much less compelling.

jbrockmendel · 2023-04-11T14:31:25Z

No traction here, closing.

PERF: groupby iteration

db3786e

mroeschke added Groupby Performance Memory or execution speed performance labels Feb 1, 2023

jbrockmendel added 2 commits February 1, 2023 13:42

mypy fixup

65d7b5a

Merge branch 'main' into perf-dont-need-finalize

7f77afe

This was referenced Mar 4, 2023

PERF: groupby.agg() regression #46505

Open

PERF: use "_name" instead of "name" as Series metadata to avoid validation #51784

Merged

jbrockmendel mentioned this pull request Mar 16, 2023

PERF: Don't copy if already sorted in DataSplitter #52035

Closed

5 tasks

jorisvandenbossche mentioned this pull request Mar 27, 2023

DEPR: __finalize__ and _metadata #51280

Open

jbrockmendel closed this Apr 11, 2023

jbrockmendel deleted the perf-dont-need-finalize branch April 11, 2023 14:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: groupby iteration #51109

PERF: groupby iteration #51109

jbrockmendel commented Feb 1, 2023

WillAyd commented Feb 4, 2023

jorisvandenbossche commented Mar 4, 2023

jbrockmendel commented Mar 4, 2023

jorisvandenbossche commented Mar 5, 2023

jbrockmendel commented Mar 24, 2023

jbrockmendel commented Apr 11, 2023

PERF: groupby iteration #51109

PERF: groupby iteration #51109

Conversation

jbrockmendel commented Feb 1, 2023

WillAyd commented Feb 4, 2023

jorisvandenbossche commented Mar 4, 2023

jbrockmendel commented Mar 4, 2023

jorisvandenbossche commented Mar 5, 2023

jbrockmendel commented Mar 24, 2023

jbrockmendel commented Apr 11, 2023