-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Fix Regression when using sum/cumsum on Groupby objects #44526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With how INF and NEGINF are defined here,
pandas/pandas/_libs/missing.pyx
Lines 35 to 37 in 3aeba78
cdef: | |
float64_t INF = <float64_t>np.inf | |
float64_t NEGINF = -INF |
Performance was reduced through introducing Kahan. I don't think we can reduce it further. Have to find a way around the gil |
got around the gil. Now it's still the additional checks for |
Usually a build error when you see that. In this case:
|
@@ -51,7 +51,14 @@ from pandas._libs.missing cimport checknull | |||
cdef int64_t NPY_NAT = get_nat() | |||
_int64_max = np.iinfo(np.int64).max | |||
|
|||
cdef float64_t NaN = <float64_t>np.NaN | |||
cdef: | |||
float32_t MINfloat32 = np.NINF |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have some comment definitions of these elsewhere in a pxd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI these are being added independntly here: #44522
What do the benchmarks look like here? Since we already have the regression #39622 from using kahan (and this adds decent complexity), would like to propose returning to previous simple summation for 1.3.5, which would fix both that regression and this one. Maybe we could expose kahan summation as an optional keyword instead (which a user can just know not to use if their data has infs?) |
Here are the performance measures: import numpy as np
import pandas as pd
import timeit
arr = np.ones(1000000)
s = pd.Series(np.arange(len(arr)))
s = s / 10
%timeit -r7 -n1000 s.groupby(arr).sum()
# on this branch
# 15.4 ms ± 172 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# on 1.3.4
# 15.4 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
idx = pd.date_range(start="1/1/2000", end="1/1/2001", freq="T")
s = pd.Series(np.random.randn(len(idx)), index=idx)
%timeit -r7 -n1000 s.resample("1D").mean()
# on this branch
# 4.67 ms ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# on 1.3.4
# 4.61 ms ± 35.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) So there is a performance regression of 2-3% |
Could you please run the asvs? |
After running the asv, this is a remarkable performance hit: from asv_bench.benchmarks.groupby import *
self = GroupByCythonAgg()
self.setup("float64", "sum")
%timeit -r 30 -n 100 self.time_frame_agg("float64", "sum")
# this PR
# 29.7 ms ± 257 µs per loop (mean ± std. dev. of 30 runs, 100 loops each)
# master
# 20.1 ms ± 279 µs per loop (mean ± std. dev. of 30 runs, 100 loops each) |
needs a release note. adding the 1.3.5 milestone to match the linked issue. thoughts on backporting these changes? |
move this off 1.3.x it's too late to change anything |
27ef25b
to
6f6da9a
Compare
Thanks @jbrockmendel for pointing me to the isinf function from numpy. This also influences the performance: In [1]: from asv_bench.benchmarks.groupby import *
In [2]: self = GroupByCythonAgg()
In [3]: self.setup("float64", "sum")
In [4]: %timeit -r 30 -n 100 self.time_frame_agg("float64", "sum")
# This PR
# 25.3 ms ± 377 µs per loop (mean ± std. dev. of 30 runs, 100 loops each)
# master
# 19.8 ms ± 355 µs per loop (mean ± std. dev. of 30 runs, 100 loops each) |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
Thanks for the PR, but it appears to have gone stale. If interested in continuing please merge the main branch and we can reopen. |
NOTE: This will reduce performance significantly. Hopefully someone points me to a better check for
nan
which can be done with released GIL.