Skip to content

ASV: docs, runtime, stability, build acceleration, PR integration? #23412

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
h-vetinari opened this issue Oct 29, 2018 · 5 comments
Open

ASV: docs, runtime, stability, build acceleration, PR integration? #23412

h-vetinari opened this issue Oct 29, 2018 · 5 comments
Assignees
Labels
Benchmark Performance (ASV) benchmarks Performance Memory or execution speed performance

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Oct 29, 2018

Currently, https://pandas.pydata.org/pandas-docs/stable/contributing.html says:

Running the full test suite can take up to one hour and use up to 3GB of RAM.

On a standard laptop with 8GB RAM and 4 cores, this was more like 6.5h last night.

I recently updated the ASV code (as recommended by contributing.html) with
pip install git+https://github.com/spacetelescope/asv
and it seems that in v.0.4, ASV runs each commit/benchmark in 2 rounds, effectively doubling the runtime? (It may well be that I don't understand what the rounds are supposed to do exactly, but the ASV ran much faster before).

Quite a lot of time is also spent doing the environment builds, and I was wondering if it wouldn't be possible to reuse the logic from python setup.py build_ext --inplace -j 4 to only cythonize the modules for which the code has changed (probably more an asv issue).

Finally, the runs are annoyingly noisy. For example, after running asv continuous -f 1.1 upstream/master HEAD overnight, with nothing else running on the machine (all other applications closed), I got something like this,

       before           after         ratio
     [360e7271]       [19c7c1f8]
     <master>         <unique_inverse_cython>
+        1.25±0ms         93.8±0ms    75.00  frame_ctor.FromRecords.time_frame_from_records_generator(None)
+        1.41±0ms       6.25±0.6ms     4.44  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
+        14.1±0ms         62.5±8ms     4.44  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
+     1.88±0.08ms       5.21±0.4ms     2.78  reindex.DropDuplicates.time_frame_drop_dups_int(True)
+        22.7±2μs         62.5±0μs     2.75  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
+      4.62±0.3ms         12.5±0ms     2.70  index_object.Indexing.time_get_loc_non_unique('Float')
+      1.41±0.2ms       3.47±0.9ms     2.47  index_object.Indexing.time_get_loc_non_unique_sorted('Int')
+        273±20μs          625±0μs     2.29  indexing.NumericSeriesIndexing.time_ix_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
+       703±200μs         1.56±0ms     2.22  inference.NumericInferOps.time_subtract(<class 'numpy.uint16'>)
+        938±60μs         1.59±0ms     1.70  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+        703±80μs         1.17±0ms     1.67  frame_methods.Iteration.time_iteritems_cached
+      1.09±0.1ms       1.63±0.1ms     1.50  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+     1.09±0.08ms       1.62±0.1ms     1.48  indexing.NumericSeriesIndexing.time_loc_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
+         938±0μs       1.35±0.1ms     1.44  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
+     1.09±0.08ms         1.56±0ms     1.43  inference.NumericInferOps.time_subtract(<class 'numpy.int16'>)
+         938±0μs       1.30±0.1ms     1.39  indexing.NumericSeriesIndexing.time_ix_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
+         141±0μs         194±20μs     1.38  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+        729±40μs          938±0μs     1.29  indexing.NumericSeriesIndexing.time_loc_list_like(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
[...]

However, I didn't trust the results because there were equally strong divergences in the other direction.

Upon rerunning asv continuous -f 1.1 upstream/master HEAD -b "^(re)?index", all those divergences vanished, and got replaced by the following (with other divergences):

       before           after         ratio
     [360e7271]       [19c7c1f8]
     <master>         <unique_inverse_cython>
+      3.12±0.2ms         15.6±2ms     5.00  index_object.Indexing.time_get_loc_non_unique_sorted('Float')
+        20.3±1μs         93.8±0μs     4.62  indexing.NonNumericSeriesIndexing.time_getitem_scalar('datetime', 'nonunique_monotonic_inc')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

This is a larger point that letting people run the ASVs on their private machines is not the most thorough approach, prone to bias (or even manipulation), and exposed to whatever else is running on their machines at the time.

Finally, a lot of the divergences are not shown if the results are too noisy according to the ASV internals - this is a general point to keep in mind, because IMO, this can mask real regressions just because the runs are noisy. I've opened airspeed-velocity/asv#752 for that.

Summing up, I think that:

  • the asv section in the docs should be updated (at least concerning the estimated runtime)
  • maybe consider pinning an ASV version?
  • disable rounds if deemed not necessary (or worth the runtime tradeoff)
  • find ways to reduce build times if possible (esp. for exploratory runs with -b "some_regex")
  • have an ASV job executed on a worker (e.g. azure) that isn't triggered by default, but can be started by a core dev for PRs that need it. This should greatly improve stability of the results (very controlled enviroment with little background noise), and is also way more transparent.
@WillAyd
Copy link
Member

WillAyd commented Oct 29, 2018

Do you have a way of figuring out how long individual benchmarks are running? IIRC there is a setup_cache method which may make sense to use in place of the normal setup if it yields the same info yet saves on unnecessary setup time with some benchmarks

@WillAyd WillAyd added Benchmark Performance (ASV) benchmarks Performance Memory or execution speed performance labels Oct 29, 2018
@jbrockmendel
Copy link
Member

@h-vetinari what you've written matches my experience.

@WillAyd my understanding (grain of salt) of setup_cache is that it is helpful when the setup method is expensive, but still involves I/O+unpickling, so doesn't save much in cases with fairly cheap setup (which I think most of ours are, but I haven't measured).

Most of our benchmarks are both fairly fast non-stateful. i.e. if there were an option to only run setup once (like setup_class in tests) instead of before each call of time_foo, we would cut a lot of runtime. (... I think)

Another intuition I have no evidence for: asv makes a ton of subprocess.Popen calls, with child process writing results to a file (json IIRC) and parent process reading those results. This may be related to the fact that asv bends over backwards to ensure major backward-compatibility; not totally sure on this one. I think a lot of that subprocess overhead is not necessary for our use cases.

For the noise, the idea I've pitched to the asv folks is to interleave the benchmarks instead of running all-benchmarks-for-commit-A followed by all-benchmarks-for-commit B. IIRC they said that would require a pretty serious re-working of asv in order to implement.

One piece of bright-side: there has been some discussion within asv (I haven't checked on it for a while, no idea how it went) about an option to aggregate results across runs. So if I run asv continuous ... K times, eventually it should get a large enough sample size to give meaningful results.

@pv
Copy link
Contributor

pv commented Nov 1, 2018

@jbrockmendel: the interleaved benchmarks were already implemented in 0.3, ditto for the option for aggregating results.

@pv
Copy link
Contributor

pv commented Nov 1, 2018

On a laptops I'd recommend to at least pin the CPU frequency/performance settings to get stable results https://asv.readthedocs.io/en/stable/tuning.html#tuning-machines-for-benchmarking (and on linux disable thermald which also tunes the frequencies on the fly).

How much collecting statistics across multiple runs helps depends on how large the CPU frequency fluctuations are. The default setting for asv is to collect statistics over k=2 reruns of the whole suite (and of course multiple samples on each single benchmark, but here they are strongly correlated). Suppose e.g. that you have probability p chance for each benchmark to be run with "slow" CPU frequency, otherwise with "fast" frequency. Pandas has N=2618 benchmarks (counting parameterized separately), and suppose the suite is interleaved k times. Then Pr[each benchmark sampled with "fast" CPU frequency] = (1 - p^k)^N. For k=2, N=2618 and e.g. p=0.05 this is ~0; for k=5 (asv run -a processes=5) >0.99 -- reality may be worse as the CPU state is probably not uncorrelated between benchmarks.

The choices are that either the benchmark run takes a long time (i.e. not asv default settings), or the system is configured to produce stable timings.

On linux, the forkserver mode pushes benchmarks overheads to <~50msec, so most time is taken by running the user code, on Windows you have bigger overhead as import pandas is quite slow. The default settings take repeat counts such that about ~1 sec total is spent per each timing benchmark.

Pandas has 2618 benchmarks (counting parameter combinations), so theoretically it should take 45min, maybe double that on Windows. This doesn't count time taken by setup methods, and pandas probably maybe has several slow ones, and benchmarks that have runtime in several seconds.

Finding slow setup() routines: https://gist.github.com/pv/9f51381ec8a488c34920749d0f00c43e

Pandas also seems to set sample_time = 0.2 for all benchmarks --- that's twice the default, so benchmarking takes twice longer (for the bare benchmarking part). I'm not really convinced increasing this option from the default will improve the accuracy of the results.

@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2018

Here's the output of @pv 's script if anyone was curious

asv_setup_times.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants