PERF: slowdown in groupby/resample mean() method #39622

jorisvandenbossche · 2021-02-06T09:05:47Z

See https://pandas.pydata.org/speed/pandas/#timeseries.ResampleSeries.time_resample?python=3.8&Cython=0.29.21&p-index='datetime'&p-freq='1D'&p-method='mean'&commits=812c3012-71a4cb69

It's from the period there were no benchmarks runs, so no clear indication which commit (range) would be responsible.

Reproducer:

idx = pd.date_range(start="1/1/2000", end="1/1/2001", freq="T")
s = pd.Series(np.random.randn(len(idx)), index=idx)
%timeit s.resample("1D").mean()

Last release:

In [2]: %timeit s.resample("1D").mean()
4.45 ms ± 507 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [3]: pd.__version__
Out[3]: '1.2.1'

on master:

In [2]: %timeit s.resample("1D").mean()
6.33 ms ± 430 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So around 50% slowdown.
And it seems somewhat specific to mean (eg I don't see a similar slowdown for eg max)

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2021-02-06T09:35:12Z

cc @phofl a recent change for grouped mean is #38934. Now, since it was a fix to improve numeric stability, some performance degradation might be expected (but might still be worth checking if there is something to improve)

phofl · 2021-02-06T10:10:13Z

Can confirm, that this was caused by #38934. But I don't think there is much that we can do, if we want to keep using Kahan summation.

8b2ebdb

%timeit s.resample("1D").mean()
3.84 ms ± 74.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

c2c5eba

%timeit s.resample("1D").mean()
5.37 ms ± 40.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

jorisvandenbossche · 2021-02-06T19:31:51Z

@phofl in the new code, there is now

                        accum[lab, j] = t
                        out[i, j] = accum[lab, j]

That last line could in principle be out[i, j] = t? That would be one array lookup less, which might give something (although I suppose that might be hardly measurable improvement compared to the additional arithmetic the PR added)

phofl · 2021-02-06T19:36:00Z

This should be the cumsum part? The mean part updates sum directly

jorisvandenbossche · 2021-02-06T19:39:07Z

Ah, yes, I was just looking at the beginning of the diff of the PR, that's indeed in the cumsum code

jorisvandenbossche · 2021-02-06T20:12:26Z

I was googling a bit about the stability of sum in numpy, and saw the mention of pairwise summation, which is said to be almost as good as Kahan summation, but with lower computational cost (https://en.wikipedia.org/wiki/Pairwise_summation). It's also what np.sum uses in most cases.
See the code in numpy here: https://github.com/numpy/numpy/blob/ce82028409c1147a6df62d8f7437e0a9262ee2b7/numpy/core/src/umath/loops_utils.h.src#L73-L138

Now, for a plain sum, that seems quite straightforward, but for a grouped sum, I suppose that would be difficult to use? (since we don't have the numbers we want to sum in one contigous array that can easily be split up)

(to be clear, I am no expert in numerical code and stability, I am just wondering about the trade-offs here, because a 50% slowdown is not nothing for probably one of the most used groupby methods)

mzeitlin11 · 2021-06-26T16:52:08Z

Agreed that pairwise would be best of both worlds, but don't think that could be implemented efficiently because of the issues you mention above. I'd guess that pandas users care more about 50% performance in a common method like .GroupBy.sum than floating point error - maybe we can make Kahan summation opt-in?

Would be pretty straightforward with existing groupby machinery to pass some new argument for accurate fp summation, but would probably end up inconsistent with other parts of our API where we don't allow that. NumPy doesn't even always use pairwise summation - we could also just force users to pass an agg function which will compute the sum with numerical stability if that's their priority.

simonjayhawkins · 2021-10-16T19:06:23Z

changing milestone to 1.3.5

simonjayhawkins · 2021-11-27T11:37:59Z

Agreed that pairwise would be best of both worlds, but don't think that could be implemented efficiently because of the issues you mention above. I'd guess that pandas users care more about 50% performance in a common method like .GroupBy.sum than floating point error - maybe we can make Kahan summation opt-in?

Would be pretty straightforward with existing groupby machinery to pass some new argument for accurate fp summation, but would probably end up inconsistent with other parts of our API where we don't allow that. NumPy doesn't even always use pairwise summation - we could also just force users to pass an agg function which will compute the sum with numerical stability if that's their priority.

for other readers, see also #44526 (comment)

simonjayhawkins · 2021-11-27T11:41:36Z

@jorisvandenbossche I guess the only way to restore the default performance of 1.2.5 is to make the Kahan summation optional. So we should either block 1.3.5 on this, move to 1.4 and add as an enhancement or close as no action?

jreback · 2021-11-27T12:56:30Z

close as no action
-1 on adding api or changing

jorisvandenbossche added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version Resample resample method labels Feb 6, 2021

jorisvandenbossche added this to the 1.3 milestone Feb 6, 2021

jorisvandenbossche changed the title ~~PERF: slowdown in Series.resample().mean()~~ PERF: slowdown in groupby/resample mean() method Feb 6, 2021

simonjayhawkins modified the milestones: 1.3, 1.3.1 Jun 30, 2021

simonjayhawkins modified the milestones: 1.3.1, 1.3.2 Jul 24, 2021

mzeitlin11 mentioned this issue Aug 4, 2021

BUG: np.mean(pd.Series) != np.mean(pd.Series.values) #42878

Closed

3 tasks

simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021

simonjayhawkins mentioned this issue Sep 5, 2021

BUG: inconsistent result when groupby then sum values that contain inf #43292

Closed

3 tasks

simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021

simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021

mzeitlin11 mentioned this issue Nov 21, 2021

Fix Regression when using sum/cumsum on Groupby objects #44526

Closed

4 tasks

simonjayhawkins closed this as completed Nov 27, 2021

simonjayhawkins modified the milestones: 1.3.5, No action Nov 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: slowdown in groupby/resample mean() method #39622

PERF: slowdown in groupby/resample mean() method #39622

jorisvandenbossche commented Feb 6, 2021

jorisvandenbossche commented Feb 6, 2021

phofl commented Feb 6, 2021 •

edited

Loading

jorisvandenbossche commented Feb 6, 2021

phofl commented Feb 6, 2021

jorisvandenbossche commented Feb 6, 2021

jorisvandenbossche commented Feb 6, 2021

mzeitlin11 commented Jun 26, 2021

simonjayhawkins commented Oct 16, 2021

simonjayhawkins commented Nov 27, 2021

simonjayhawkins commented Nov 27, 2021

jreback commented Nov 27, 2021

PERF: slowdown in groupby/resample mean() method #39622

PERF: slowdown in groupby/resample mean() method #39622

Comments

jorisvandenbossche commented Feb 6, 2021

jorisvandenbossche commented Feb 6, 2021

phofl commented Feb 6, 2021 • edited Loading

jorisvandenbossche commented Feb 6, 2021

phofl commented Feb 6, 2021

jorisvandenbossche commented Feb 6, 2021

jorisvandenbossche commented Feb 6, 2021

mzeitlin11 commented Jun 26, 2021

simonjayhawkins commented Oct 16, 2021

simonjayhawkins commented Nov 27, 2021

simonjayhawkins commented Nov 27, 2021

jreback commented Nov 27, 2021

phofl commented Feb 6, 2021 •

edited

Loading