Skip to content

PERF: slowdown in groupby/resample mean() method #39622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jorisvandenbossche opened this issue Feb 6, 2021 · 11 comments
Closed

PERF: slowdown in groupby/resample mean() method #39622

jorisvandenbossche opened this issue Feb 6, 2021 · 11 comments
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Resample resample method

Comments

@jorisvandenbossche
Copy link
Member

See https://pandas.pydata.org/speed/pandas/#timeseries.ResampleSeries.time_resample?python=3.8&Cython=0.29.21&p-index='datetime'&p-freq='1D'&p-method='mean'&commits=812c3012-71a4cb69

It's from the period there were no benchmarks runs, so no clear indication which commit (range) would be responsible.

Reproducer:

idx = pd.date_range(start="1/1/2000", end="1/1/2001", freq="T")
s = pd.Series(np.random.randn(len(idx)), index=idx)
%timeit s.resample("1D").mean()

Last release:

In [2]: %timeit s.resample("1D").mean()
4.45 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: pd.__version__
Out[3]: '1.2.1'

on master:

In [2]: %timeit s.resample("1D").mean()
6.33 ms ± 430 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So around 50% slowdown.
And it seems somewhat specific to mean (eg I don't see a similar slowdown for eg max)

@jorisvandenbossche jorisvandenbossche added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Resample resample method labels Feb 6, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Feb 6, 2021
@jorisvandenbossche
Copy link
Member Author

cc @phofl a recent change for grouped mean is #38934. Now, since it was a fix to improve numeric stability, some performance degradation might be expected (but might still be worth checking if there is something to improve)

@phofl
Copy link
Member

phofl commented Feb 6, 2021

Can confirm, that this was caused by #38934. But I don't think there is much that we can do, if we want to keep using Kahan summation.

8b2ebdb

%timeit s.resample("1D").mean()
3.84 ms ± 74.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

c2c5eba

%timeit s.resample("1D").mean()
5.37 ms ± 40.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@jorisvandenbossche
Copy link
Member Author

@phofl in the new code, there is now

                        accum[lab, j] = t
                        out[i, j] = accum[lab, j]

That last line could in principle be out[i, j] = t? That would be one array lookup less, which might give something (although I suppose that might be hardly measurable improvement compared to the additional arithmetic the PR added)

@phofl
Copy link
Member

phofl commented Feb 6, 2021

This should be the cumsum part? The mean part updates sum directly

@jorisvandenbossche
Copy link
Member Author

Ah, yes, I was just looking at the beginning of the diff of the PR, that's indeed in the cumsum code

@jorisvandenbossche
Copy link
Member Author

I was googling a bit about the stability of sum in numpy, and saw the mention of pairwise summation, which is said to be almost as good as Kahan summation, but with lower computational cost (https://en.wikipedia.org/wiki/Pairwise_summation). It's also what np.sum uses in most cases.
See the code in numpy here: https://github.com/numpy/numpy/blob/ce82028409c1147a6df62d8f7437e0a9262ee2b7/numpy/core/src/umath/loops_utils.h.src#L73-L138

Now, for a plain sum, that seems quite straightforward, but for a grouped sum, I suppose that would be difficult to use? (since we don't have the numbers we want to sum in one contigous array that can easily be split up)

(to be clear, I am no expert in numerical code and stability, I am just wondering about the trade-offs here, because a 50% slowdown is not nothing for probably one of the most used groupby methods)

@jorisvandenbossche jorisvandenbossche changed the title PERF: slowdown in Series.resample().mean() PERF: slowdown in groupby/resample mean() method Feb 6, 2021
@mzeitlin11
Copy link
Member

Agreed that pairwise would be best of both worlds, but don't think that could be implemented efficiently because of the issues you mention above. I'd guess that pandas users care more about 50% performance in a common method like .GroupBy.sum than floating point error - maybe we can make Kahan summation opt-in?

Would be pretty straightforward with existing groupby machinery to pass some new argument for accurate fp summation, but would probably end up inconsistent with other parts of our API where we don't allow that. NumPy doesn't even always use pairwise summation - we could also just force users to pass an agg function which will compute the sum with numerical stability if that's their priority.

@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins
Copy link
Member

Agreed that pairwise would be best of both worlds, but don't think that could be implemented efficiently because of the issues you mention above. I'd guess that pandas users care more about 50% performance in a common method like .GroupBy.sum than floating point error - maybe we can make Kahan summation opt-in?

Would be pretty straightforward with existing groupby machinery to pass some new argument for accurate fp summation, but would probably end up inconsistent with other parts of our API where we don't allow that. NumPy doesn't even always use pairwise summation - we could also just force users to pass an agg function which will compute the sum with numerical stability if that's their priority.

for other readers, see also #44526 (comment)

@simonjayhawkins
Copy link
Member

@jorisvandenbossche I guess the only way to restore the default performance of 1.2.5 is to make the Kahan summation optional. So we should either block 1.3.5 on this, move to 1.4 and add as an enhancement or close as no action?

@jreback
Copy link
Contributor

jreback commented Nov 27, 2021

close as no action
-1 on adding api or changing

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.5, No action Nov 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Resample resample method
Projects
None yet
Development

No branches or pull requests

5 participants