|
| 1 | +# Library optimizations and benchmarking |
| 2 | + |
| 3 | +Recommended reading: [The Rust performance book](https://nnethercote.github.io/perf-book/title-page.html) |
| 4 | + |
| 5 | +## What to optimize |
| 6 | + |
| 7 | +It's preferred to optimize code that shows up as significant in real-world code. |
| 8 | +E.g. it's more beneficial to speed up `[T]::sort` than it is to shave off a small allocation in `Command::spawn` |
| 9 | +because the latter is dominated by its syscall cost. |
| 10 | + |
| 11 | +Issues about slow library code are labeled as [I-slow T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-slow+label%3AT-libs) |
| 12 | +and those about code size as [I-heavy T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-heavy+label%3AT-libs) |
| 13 | + |
| 14 | +## Vectorization |
| 15 | + |
| 16 | +Currently explicit SIMD features can't be used in alloc or core because runtime feature-detection is only available in std |
| 17 | +and they are compiled with each target's baseline feature set. |
| 18 | + |
| 19 | +Vectorization can only be achieved by shaping code in a way that the compiler backend's auto-vectorization passes can understand. |
| 20 | + |
| 21 | +## rustc-perf |
| 22 | + |
| 23 | +For parts of the standard library that are heavily used by rustc itself it can be convenient to use |
| 24 | +[the benchmark server](https://github.com/rust-lang/rustc-perf/tree/master/collector#benchmarking). |
| 25 | + |
| 26 | +Since it only measures compile-time but not runtime performance of crates it can't be used to benchmark for features |
| 27 | +that aren't used by the compiler, e.g. floating point code, linked lists, mpsc channels, etc. |
| 28 | +For those explicit benchmarks must be written or extracted from real-world code. |
| 29 | + |
| 30 | +## Built-in Microbenchmarks |
| 31 | + |
| 32 | +The built-in benchmarks use [cargo bench](https://doc.rust-lang.org/nightly/unstable-book/library-features/test.html) |
| 33 | +and can be found in the `benches` directory for `core` and `alloc` and in `test` modules in `std`. |
| 34 | + |
| 35 | +The benchmarks are automatically executed run in a loop by `Bencher::iter` to average the runtime over many loop-iterations. |
| 36 | +For CPU-bound microbenchmarks the runtime of a single iteration should be in the range of nano- to microseconds. |
| 37 | + |
| 38 | +To run a specific can be invoked without recompiling rustc |
| 39 | +via `./x bench library/<lib> --stage 0 --test-args <benchmark name>`. |
| 40 | + |
| 41 | +`cargo bench` measures wall-time. This often is good enough, but small changes such as saving a few instructions |
| 42 | +in a bigger function can get drowned out by system noise. In such cases the following changes can make runs more |
| 43 | +reproducible: |
| 44 | + |
| 45 | +* disable incremental builds in `config.toml` |
| 46 | +* build std and the benchmarks with `RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1"` |
| 47 | +* ensure the system is as idle as possible |
| 48 | +* [disable ASLR](https://man7.org/linux/man-pages/man8/setarch.8.html) |
| 49 | +* [pinning](https://man7.org/linux/man-pages/man1/taskset.1.html) the benchmark process to a specific core |
| 50 | +* [disable clock boosts](https://wiki.archlinux.org/title/CPU_frequency_scaling#Configuring_frequency_boosting), |
| 51 | + especially on thermal-limited systems such as laptops |
| 52 | + |
| 53 | +## Standalone tests |
| 54 | + |
| 55 | +If `x` or the cargo benchmark harness get in the way it can be useful to extract the benchmark into a separate crate, |
| 56 | +e.g. to run it under `perf stat` or cachegrind. |
| 57 | + |
| 58 | +Build and link the [stage1](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain) |
| 59 | +compiler as rustup toolchain and then use that to build the standalone benchmark with a modified standard library. |
| 60 | + |
| 61 | +[Currently](https://github.com/rust-lang/rust/issues/101691) there is no convenient way to invoke a stage0 toolchain with |
| 62 | +a modified standard library. To avoid the compiler rebuild it can be useful to not only extract the benchmark but also |
| 63 | +the code under test into a separate crate. |
| 64 | + |
| 65 | +## Running under perf-record |
| 66 | + |
| 67 | +If extracting the code into a separate crate is impractical one can first build the benchmark and then run it again |
| 68 | +under `perf record` and then drill down to the benchmark kernel with `perf report`. |
| 69 | + |
| 70 | +```terminal,ignore |
| 71 | +# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations |
| 72 | +$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2" |
| 73 | +
|
| 74 | +# build benchmark without running it |
| 75 | +$ ./x bench --stage 0 library/core/ --test-args skipallbenches |
| 76 | +
|
| 77 | +# run the benchmark under perf |
| 78 | +$ perf record --call-graph dwarf -e instructions ./x bench --stage 0 library/core/ --test-args <benchmark name> |
| 79 | +$ perf report |
| 80 | +``` |
| 81 | + |
| 82 | +By renaming `perf.data` to keep it from getting overwritten by subsequent runs it can be later compared to runs with |
| 83 | +a modified library with `perf diff`. |
| 84 | + |
| 85 | +## comparing assembly |
| 86 | + |
| 87 | +While `perf report` shows assembly of the benchmark code it can sometimes be difficult to get a good overview of what |
| 88 | +changed, especially when multiple benchmarks were affected. As an alternative one can extract and diff the assembly |
| 89 | +directly from the benchmark suite. |
| 90 | + |
| 91 | +```terminal,ignore |
| 92 | +# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations |
| 93 | +$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2" |
| 94 | +
|
| 95 | +# build benchmark libs |
| 96 | +$ ./x bench --stage 0 library/core/ --test-args skipallbenches |
| 97 | +
|
| 98 | +# this should print something like the following |
| 99 | +Running benches/lib.rs (build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a) |
| 100 | +
|
| 101 | +# get the assembly for all the benchmarks |
| 102 | +$ objdump --source --disassemble --wide --no-show-raw-insn --no-addresses \ |
| 103 | + build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a \ |
| 104 | + | rustfilt > baseline.asm |
| 105 | +
|
| 106 | +# switch to the branch with the changes |
| 107 | +$ git switch feature-branch |
| 108 | +
|
| 109 | +# repeat the procedure above |
| 110 | +$ ./x bench ... |
| 111 | +$ objdump ... > changes.asm |
| 112 | +
|
| 113 | +# compare output |
| 114 | +$ kdiff3 baseline.asm changes.asm |
| 115 | +``` |
| 116 | + |
| 117 | +This can also be applied to standalone benchmarks. |
0 commit comments