-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: slowdown in groupby/resample mean() method #39622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can confirm, that this was caused by #38934. But I don't think there is much that we can do, if we want to keep using Kahan summation.
|
@phofl in the new code, there is now
That last line could in principle be |
This should be the cumsum part? The mean part updates sum directly |
Ah, yes, I was just looking at the beginning of the diff of the PR, that's indeed in the cumsum code |
I was googling a bit about the stability of sum in numpy, and saw the mention of pairwise summation, which is said to be almost as good as Kahan summation, but with lower computational cost (https://en.wikipedia.org/wiki/Pairwise_summation). It's also what Now, for a plain sum, that seems quite straightforward, but for a grouped sum, I suppose that would be difficult to use? (since we don't have the numbers we want to sum in one contigous array that can easily be split up) (to be clear, I am no expert in numerical code and stability, I am just wondering about the trade-offs here, because a 50% slowdown is not nothing for probably one of the most used groupby methods) |
Agreed that pairwise would be best of both worlds, but don't think that could be implemented efficiently because of the issues you mention above. I'd guess that pandas users care more about 50% performance in a common method like Would be pretty straightforward with existing groupby machinery to pass some new argument for accurate fp summation, but would probably end up inconsistent with other parts of our API where we don't allow that. NumPy doesn't even always use pairwise summation - we could also just force users to pass an |
changing milestone to 1.3.5 |
for other readers, see also #44526 (comment) |
@jorisvandenbossche I guess the only way to restore the default performance of 1.2.5 is to make the Kahan summation optional. So we should either block 1.3.5 on this, move to 1.4 and add as an enhancement or close as no action? |
close as no action |
See https://pandas.pydata.org/speed/pandas/#timeseries.ResampleSeries.time_resample?python=3.8&Cython=0.29.21&p-index='datetime'&p-freq='1D'&p-method='mean'&commits=812c3012-71a4cb69
It's from the period there were no benchmarks runs, so no clear indication which commit (range) would be responsible.
Reproducer:
Last release:
on master:
So around 50% slowdown.
And it seems somewhat specific to
mean
(eg I don't see a similar slowdown for egmax
)The text was updated successfully, but these errors were encountered: