PERF: reduce overhead in groupby _cython_operation #40317

jorisvandenbossche · 2021-03-09T08:46:34Z

The BaseGrouper._cython_operation has lots of checking of the dtype etc, which gives quite some overhead compared to the actual aggregation function (for small data). This is a first set of small changes to cut down that overhead.

Using the GroupManyLabels benchmark we have

ncols = 1000
N = 1000
data = np.random.randn(N, ncols)
labels = np.random.randint(0, 100, size=N)
df = DataFrame(data)
df_am = df._as_manager('array')

gives

In [2]: %timeit df_am.groupby(labels).sum()
48.9 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
39.1 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- PR

jorisvandenbossche · 2021-03-09T08:47:36Z

pandas/core/dtypes/missing.py

-    if isna_compat(arr, fill_value):
-        arr.fill(fill_value)
+    if arr.dtype.kind not in ("u", "i", "b"):
+        arr.fill(np.nan)


maybe_fill is only used in _cython_operation, and thus always used with an ndarray and with np.nan as fill value, so therefore simplified this function a bit, based on those assumptions.

jorisvandenbossche · 2021-03-09T08:48:35Z

pandas/core/groupby/ops.py

            # we use iNaT for the missing value on ints
            # so pre-convert to guard this condition
            if (values == iNaT).any():
                values = ensure_float64(values)
            else:
                values = ensure_int_or_float(values)
-        elif is_numeric and not is_complex_dtype(values):
-            values = ensure_float64(ensure_float(values))


The ensure_float was needed before to ensure that a nullable float EA was converted to the float ndarray. But now EAs already take a different code path above.

jbrockmendel · 2021-03-09T18:46:45Z

can you also report how the non-AM perf is affected?

jorisvandenbossche · 2021-03-09T22:05:14Z

can you also report how the non-AM perf is affected?

For the benchmark, it's not really affected. The overhead I am optimizing here is only around 1% for the BlockManager case. It's only when doing those checks 1000x instead of 1x that this becomes a noticeable bottleneck (the benchmarks has 1000 columns / single block).

I assume that for a small Series, the overhead might also be noticeable for non-AM. But trying this, and I don't really see much difference, and that's because for a single column, the dominant time is the actual factorization (which for 1000 columns is also only done once). Using the code from above:

In [2]: %timeit df[0].groupby(labels).sum()
384 µs ± 4.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  <-- master
374 µs ± 4.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  <-- PR

jorisvandenbossche · 2021-03-09T22:06:58Z

(some failing tests related to (unsupported) operations on categorical column that I still need to address)

jorisvandenbossche · 2021-03-12T15:40:20Z

This is all passing now

jbrockmendel · 2021-03-12T17:15:14Z

pandas/core/dtypes/missing.py

@@ -556,12 +556,12 @@ def infer_fill_value(val):
    return np.nan


-def maybe_fill(arr, fill_value=np.nan):
+def maybe_fill(arr: np.ndarray):


-> np.ndarray? (or just remove the return statement i guess)

jbrockmendel · 2021-03-12T17:16:07Z

pandas/core/groupby/ops.py

-        dtype = values.dtype
+        if is_numeric:
+            # never an invalid op for those dtypes, so return early as fastpath
+            return

        if is_categorical_dtype(dtype) or is_sparse(dtype):


if we're micro-optimizing, these can be replaced with isinstance(dtype, FooDtype) checks

I was planning to do a follow-up PR with several other dtype check optimizations, so will include it there

jbrockmendel · 2021-03-12T17:17:03Z

LGTM ex one requested annotation, cc @jreback

jreback

lgtm, surprised this has any speedup...

once you can fix the annotation go ahead and merge

PERF: reduce overhead in groupby _cython_operation

a392dd0

jorisvandenbossche added Groupby Performance Memory or execution speed performance labels Mar 9, 2021

jorisvandenbossche commented Mar 9, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into am-perf-groupby-ops

c728a15

jorisvandenbossche added 4 commits March 12, 2021 08:48

Merge remote-tracking branch 'upstream/master' into am-perf-groupby-ops

c9c764e

fix passing dtype

c738c64

Merge remote-tracking branch 'upstream/master' into am-perf-groupby-ops

d00aea9

remove type ignore

606c4c2

jbrockmendel reviewed Mar 12, 2021

View reviewed changes

add return type annotation

27f27d1

jreback added this to the 1.3 milestone Mar 12, 2021

jreback approved these changes Mar 12, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into am-perf-groupby-ops

7414ce7

jorisvandenbossche mentioned this pull request Mar 15, 2021

Refactor - ArrayManager overview issue #39146

Closed

11 tasks

jorisvandenbossche added 2 commits March 15, 2021 14:15

Merge remote-tracking branch 'upstream/master' into am-perf-groupby-ops

897b989

update typing error

cf0e90a

jorisvandenbossche merged commit f111175 into pandas-dev:master Mar 15, 2021

jorisvandenbossche deleted the am-perf-groupby-ops branch March 15, 2021 15:14

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

PERF: reduce overhead in groupby _cython_operation (pandas-dev#40317)

450ef27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: reduce overhead in groupby _cython_operation #40317

PERF: reduce overhead in groupby _cython_operation #40317

jorisvandenbossche commented Mar 9, 2021

jorisvandenbossche Mar 9, 2021

jorisvandenbossche Mar 9, 2021

jbrockmendel commented Mar 9, 2021

jorisvandenbossche commented Mar 9, 2021

jorisvandenbossche commented Mar 9, 2021

jorisvandenbossche commented Mar 12, 2021

jbrockmendel Mar 12, 2021

jbrockmendel Mar 12, 2021

jorisvandenbossche Mar 12, 2021

jbrockmendel commented Mar 12, 2021

jreback left a comment

PERF: reduce overhead in groupby _cython_operation #40317

PERF: reduce overhead in groupby _cython_operation #40317

Conversation

jorisvandenbossche commented Mar 9, 2021

jorisvandenbossche Mar 9, 2021

Choose a reason for hiding this comment

jorisvandenbossche Mar 9, 2021

Choose a reason for hiding this comment

jbrockmendel commented Mar 9, 2021

jorisvandenbossche commented Mar 9, 2021

jorisvandenbossche commented Mar 9, 2021

jorisvandenbossche commented Mar 12, 2021

jbrockmendel Mar 12, 2021

Choose a reason for hiding this comment

jbrockmendel Mar 12, 2021

Choose a reason for hiding this comment

jorisvandenbossche Mar 12, 2021

Choose a reason for hiding this comment

jbrockmendel commented Mar 12, 2021

jreback left a comment

Choose a reason for hiding this comment