CLN/PERF: no need for kahan for int group_cumsum #41874

mzeitlin11 · 2021-06-08T16:23:04Z

Surprised at lack of impact here - doesn't noticeably affect benchmarks.

Targeting the cython algo specifically shows an improvement (but smaller than I'd expect given the removed operations):

import numpy as np
import pandas._libs.groupby as libgroupby

N = 4_000_000
vals = np.random.randint(0, 10, (N, 5), dtype=np.int64)
result = np.empty_like(vals)

%timeit libgroupby.group_cumsum(result, vals, np.ones(N, dtype="int"), 1, False)
# 28.9 ms ± 805 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # this pr
# 37.3 ms ± 206 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)   # master

mzeitlin11 · 2021-06-08T16:23:49Z

pandas/_libs/groupby.pyx

@@ -253,18 +253,16 @@ def group_cumsum(numeric[:, ::1] out,
                        t = accum[lab, j] + y
                        compensation[lab, j] = t - accum[lab, j] - y
                        accum[lab, j] = t
-                        out[i, j] = accum[lab, j]
+                        out[i, j] = t


Doubt this affects compiled result, but may as well to not depend on a smart compiler avoiding this extra indexing step

jreback · 2021-06-08T22:18:04Z

pandas/_libs/groupby.pyx

-                    y = val - compensation[lab, j]
-                    t = accum[lab, j] + y
-                    compensation[lab, j] = t - accum[lab, j] - y
+                    t = val + accum[lab, j]


umm this is affecting all dtypes. do we not have tests for this for small floats?

This is inside the else block from an if statement if numeric == float32_t or numeric == float64_t: so only non-floats should end up here.

ahh ok that was not clear from the difff

can you add a comment to that effect (maybe just on the float32/64 branch, e.g. using Kahan summation)

Have added a comment

jreback · 2021-06-09T00:28:35Z

hmm seemingly unrelated failures. maybe on master? cc @jbrockmendel

jreback · 2021-06-09T12:19:28Z

thanks @mzeitlin11

mzeitlin11 added 5 commits June 8, 2021 11:25

PERF: group_cumsum ints/datetimelike

e2fb9ac

WIP

246d829

PERF/CLN: no need for kahan for int group_cumsum

6c1b9ca

Add benchmark

8b8a832

Change benchmark name

06e98ab

mzeitlin11 commented Jun 8, 2021

View reviewed changes

mzeitlin11 added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Clean Groupby Performance Memory or execution speed performance labels Jun 8, 2021

jreback requested changes Jun 8, 2021

View reviewed changes

mzeitlin11 added 2 commits June 8, 2021 19:39

Merge remote-tracking branch 'upstream/master' into perf/grp_cumsum_int

a24a7f4

Add kahan comment

208e6ed

jreback added this to the 1.3 milestone Jun 9, 2021

Merge remote-tracking branch 'upstream/master' into perf/grp_cumsum_int

6996b57

jreback approved these changes Jun 9, 2021

View reviewed changes

jreback merged commit 3ca84fc into pandas-dev:master Jun 9, 2021

mzeitlin11 deleted the perf/grp_cumsum_int branch June 9, 2021 12:48

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

CLN/PERF: no need for kahan for int group_cumsum (pandas-dev#41874)

37b5c3f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN/PERF: no need for kahan for int group_cumsum #41874

CLN/PERF: no need for kahan for int group_cumsum #41874

mzeitlin11 commented Jun 8, 2021

mzeitlin11 Jun 8, 2021 •

edited

Loading

jreback Jun 8, 2021

mzeitlin11 Jun 8, 2021

jreback Jun 8, 2021

jreback Jun 8, 2021

mzeitlin11 Jun 8, 2021

jreback commented Jun 9, 2021

jreback commented Jun 9, 2021

CLN/PERF: no need for kahan for int group_cumsum #41874

CLN/PERF: no need for kahan for int group_cumsum #41874

Conversation

mzeitlin11 commented Jun 8, 2021

mzeitlin11 Jun 8, 2021 • edited Loading

Choose a reason for hiding this comment

jreback Jun 8, 2021

Choose a reason for hiding this comment

mzeitlin11 Jun 8, 2021

Choose a reason for hiding this comment

jreback Jun 8, 2021

Choose a reason for hiding this comment

jreback Jun 8, 2021

Choose a reason for hiding this comment

mzeitlin11 Jun 8, 2021

Choose a reason for hiding this comment

jreback commented Jun 9, 2021

jreback commented Jun 9, 2021

mzeitlin11 Jun 8, 2021 •

edited

Loading