Skip to content

CLN/PERF: no need for kahan for int group_cumsum #41874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions asv_bench/benchmarks/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -507,11 +507,11 @@ def time_frame_agg(self, dtype, method):
self.df.groupby("key").agg(method)


class CumminMax:
class Cumulative:
param_names = ["dtype", "method"]
params = [
["float64", "int64", "Float64", "Int64"],
["cummin", "cummax"],
["cummin", "cummax", "cumsum"],
]

def setup(self, dtype, method):
Expand Down
10 changes: 5 additions & 5 deletions pandas/_libs/groupby.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -247,24 +247,24 @@ def group_cumsum(numeric[:, ::1] out,
for j in range(K):
val = values[i, j]

# For floats, use Kahan summation to reduce floating-point
# error (https://en.wikipedia.org/wiki/Kahan_summation_algorithm)
if numeric == float32_t or numeric == float64_t:
if val == val:
y = val - compensation[lab, j]
t = accum[lab, j] + y
compensation[lab, j] = t - accum[lab, j] - y
accum[lab, j] = t
out[i, j] = accum[lab, j]
out[i, j] = t
Copy link
Member Author

@mzeitlin11 mzeitlin11 Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doubt this affects compiled result, but may as well to not depend on a smart compiler avoiding this extra indexing step

else:
out[i, j] = NaN
if not skipna:
accum[lab, j] = NaN
break
else:
y = val - compensation[lab, j]
t = accum[lab, j] + y
compensation[lab, j] = t - accum[lab, j] - y
t = val + accum[lab, j]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm this is affecting all dtypes. do we not have tests for this for small floats?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inside the else block from an if statement if numeric == float32_t or numeric == float64_t: so only non-floats should end up here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh ok that was not clear from the difff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment to that effect (maybe just on the float32/64 branch, e.g. using Kahan summation)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added a comment

accum[lab, j] = t
out[i, j] = accum[lab, j]
out[i, j] = t


@cython.boundscheck(False)
Expand Down