PERF: nanops #43311

jbrockmendel · 2021-08-30T20:49:49Z

from asv_bench.benchmarks.stat_ops import *
self = FrameOps()
self.setup("skew", "float", 0)

%timeit self.time_op("skew", "float", 0)
3.22 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- master
2.06 ms ± 71.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- PR

%timeit self.time_op("skew", "int", 0)
3.81 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- master
2.3 ms ± 79.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- PR

%timeit self.time_op("kurt", "float", 0)
2.79 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- master
2.2 ms ± 85.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <- PR

%timeit self.time_op("kurt", "int", 0)
3.29 ms ± 131 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- master
2.35 ms ± 69.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- PR

%timeit self.time_op("prod", "float", 0)
2.9 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- master
2.23 ms ± 40.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- PR

%timeit self.time_op("sum", "float", 0)
2.9 ms ± 194 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- master
2.14 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <- PR

alimcmaster1 · 2021-08-31T16:03:26Z

LGTM. Question how come we only apply this to certain nanops. Why not nanstd for example?

jbrockmendel · 2021-08-31T16:13:35Z

Question how come we only apply this to certain nanops

These are the only ones that showed a big difference in the asvs.

jorisvandenbossche · 2021-11-21T20:36:22Z

This seems to give a slowdown in the nan* methods in case the axis 1 is smaller compared to axis 0:

values = np.random.randn(1000000, 4)

In [9]: %timeit pd.core.nanops.nansum(values, axis=1, skipna=True)
47.9 ms ± 554 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- pandas 1.3
18.4 s ± 808 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # <-- master

This is a gigantic difference which I noticed in a case for the ArrayManager. But also for BlockManager with relatively wide dataframe it gives a slowdown:

values = np.random.randn(1000, 4)
df = pd.DataFrame(values).copy()

In [12]: %timeit df.sum(axis=1)
138 µs ± 21.9 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- pandas 1.3
1.83 ms ± 80.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- master

Since the performance improvement in the examples in the top post were only small (compared to the slowdowns showed above), I would maybe either 1) revert the optimization or 2) add some threshold for the shape (eg only take this custom path if values.shape[1] < (values.shape[0] / 10))

jbrockmendel · 2021-11-21T20:46:19Z

add some threshold for the shape

seems reasonable

jorisvandenbossche · 2021-11-21T20:48:05Z

Some very rough comparisons:

values = np.random.randn(10000, 100)

In [5]: %timeit pd.core.nanops.nansum(values, axis=1, skipna=True)
2.05 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- pandas 1.3
174 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)  # <-- master

values = np.random.randn(1000, 1000)

In [7]: %timeit pd.core.nanops.nansum(values, axis=1, skipna=True)
2.02 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- pandas 1.3
17 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master

values = np.random.randn(100, 10000)

In [9]: %timeit pd.core.nanops.nansum(values, axis=1, skipna=True)
1.87 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- pandas 1.3
2.4 ms ± 38.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)  # <-- master

values = np.random.randn(10, 100000)

In [11]: %timeit pd.core.nanops.nansum(values, axis=1, skipna=True)
2.15 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- pandas 1.3
1.23 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- master

Based on this, something shape[1] / shape[0] > 1000 might be a good criterion.

jorisvandenbossche · 2021-11-21T20:50:46Z

Added that in #44566

jbrockmendel added 2 commits August 30, 2021 13:48

PERF: nanops

37f588f

corner cases

5074138

alimcmaster1 added this to the 1.4 milestone Aug 31, 2021

alimcmaster1 added the Performance Memory or execution speed performance label Aug 31, 2021

alimcmaster1 approved these changes Aug 31, 2021

View reviewed changes

jreback merged commit 1633e32 into pandas-dev:master Aug 31, 2021

jbrockmendel deleted the perf-reductions branch August 31, 2021 18:32

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

PERF: nanops (pandas-dev#43311)

51c79b2

jorisvandenbossche mentioned this pull request Nov 21, 2021

PERF: only apply nanops rowwise optimization for narrow arrows #44566

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: nanops #43311

PERF: nanops #43311

jbrockmendel commented Aug 30, 2021

alimcmaster1 commented Aug 31, 2021

jbrockmendel commented Aug 31, 2021

jorisvandenbossche commented Nov 21, 2021

jbrockmendel commented Nov 21, 2021

jorisvandenbossche commented Nov 21, 2021 •

edited

Loading

jorisvandenbossche commented Nov 21, 2021

PERF: nanops #43311

PERF: nanops #43311

Conversation

jbrockmendel commented Aug 30, 2021

alimcmaster1 commented Aug 31, 2021

jbrockmendel commented Aug 31, 2021

jorisvandenbossche commented Nov 21, 2021

jbrockmendel commented Nov 21, 2021

jorisvandenbossche commented Nov 21, 2021 • edited Loading

jorisvandenbossche commented Nov 21, 2021

jorisvandenbossche commented Nov 21, 2021 •

edited

Loading