Skip to content

PERF: reduce overhead in groupby _cython_operation #40317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jorisvandenbossche
Copy link
Member

The BaseGrouper._cython_operation has lots of checking of the dtype etc, which gives quite some overhead compared to the actual aggregation function (for small data). This is a first set of small changes to cut down that overhead.

Using the GroupManyLabels benchmark we have

ncols = 1000
N = 1000
data = np.random.randn(N, ncols)
labels = np.random.randint(0, 100, size=N)
df = DataFrame(data)
df_am = df._as_manager('array')

gives

In [2]: %timeit df_am.groupby(labels).sum()
48.9 ms ± 1.92 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- master
39.1 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <-- PR

@jorisvandenbossche jorisvandenbossche added Groupby Performance Memory or execution speed performance labels Mar 9, 2021
if isna_compat(arr, fill_value):
arr.fill(fill_value)
if arr.dtype.kind not in ("u", "i", "b"):
arr.fill(np.nan)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe_fill is only used in _cython_operation, and thus always used with an ndarray and with np.nan as fill value, so therefore simplified this function a bit, based on those assumptions.

# we use iNaT for the missing value on ints
# so pre-convert to guard this condition
if (values == iNaT).any():
values = ensure_float64(values)
else:
values = ensure_int_or_float(values)
elif is_numeric and not is_complex_dtype(values):
values = ensure_float64(ensure_float(values))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ensure_float was needed before to ensure that a nullable float EA was converted to the float ndarray. But now EAs already take a different code path above.

@jbrockmendel
Copy link
Member

can you also report how the non-AM perf is affected?

@jorisvandenbossche
Copy link
Member Author

can you also report how the non-AM perf is affected?

For the benchmark, it's not really affected. The overhead I am optimizing here is only around 1% for the BlockManager case. It's only when doing those checks 1000x instead of 1x that this becomes a noticeable bottleneck (the benchmarks has 1000 columns / single block).

I assume that for a small Series, the overhead might also be noticeable for non-AM. But trying this, and I don't really see much difference, and that's because for a single column, the dominant time is the actual factorization (which for 1000 columns is also only done once). Using the code from above:

In [2]: %timeit df[0].groupby(labels).sum()
384 µs ± 4.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  <-- master
374 µs ± 4.18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  <-- PR

@jorisvandenbossche
Copy link
Member Author

(some failing tests related to (unsupported) operations on categorical column that I still need to address)

@jorisvandenbossche
Copy link
Member Author

This is all passing now

@@ -556,12 +556,12 @@ def infer_fill_value(val):
return np.nan


def maybe_fill(arr, fill_value=np.nan):
def maybe_fill(arr: np.ndarray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> np.ndarray? (or just remove the return statement i guess)

dtype = values.dtype
if is_numeric:
# never an invalid op for those dtypes, so return early as fastpath
return

if is_categorical_dtype(dtype) or is_sparse(dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we're micro-optimizing, these can be replaced with isinstance(dtype, FooDtype) checks

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to do a follow-up PR with several other dtype check optimizations, so will include it there

@jbrockmendel
Copy link
Member

LGTM ex one requested annotation, cc @jreback

@jreback jreback added this to the 1.3 milestone Mar 12, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, surprised this has any speedup...

once you can fix the annotation go ahead and merge

@jorisvandenbossche jorisvandenbossche merged commit f111175 into pandas-dev:master Mar 15, 2021
@jorisvandenbossche jorisvandenbossche deleted the am-perf-groupby-ops branch March 15, 2021 15:14
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants