|
| 1 | +# Library optimizations and benchmarking |
| 2 | + |
| 3 | +Recommended reading: [The Rust performance book](https://nnethercote.github.io/perf-book/title-page.html) |
| 4 | + |
| 5 | +## What to optimize |
| 6 | + |
| 7 | +It's preferred to optimize code that shows up as significant in real-world code. |
| 8 | +E.g. it's more beneficial to speed up `[T]::sort` than it is to shave off a small allocation in `Command::spawn` |
| 9 | +because the latter is dominated by its syscall cost. |
| 10 | + |
| 11 | +Issues about slow library code are labeled as [I-slow T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-slow+label%3AT-libs) |
| 12 | +and those about code size as [I-heavy T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-heavy+label%3AT-libs) |
| 13 | + |
| 14 | +## Vectorization |
| 15 | + |
| 16 | +Currently only baseline target features (e.g. SSE2 on x86_64-unknown-linux-gnu) can be used in core and alloc because |
| 17 | +runtime feature-detection is only available in std. |
| 18 | +Where possible the preferred way to achieve vectorization is by shaping code in a way that the compiler |
| 19 | +backend's auto-vectorization passes can understand. This benefits user crates compiled with additional target features |
| 20 | +when they instantiate generic library functions, e.g. iterators. |
| 21 | + |
| 22 | +## rustc-perf |
| 23 | + |
| 24 | +For parts of the standard library that are heavily used by rustc itself it can be convenient to use |
| 25 | +[the benchmark server](https://github.com/rust-lang/rustc-perf/tree/master/collector#benchmarking). |
| 26 | + |
| 27 | +Since it only measures compile-time but not runtime performance of crates it can't be used to benchmark for features |
| 28 | +that aren't used by the compiler, e.g. floating point code, linked lists, mpsc channels, etc. |
| 29 | +For those explicit benchmarks must be written or extracted from real-world code. |
| 30 | + |
| 31 | +## Built-in Microbenchmarks |
| 32 | + |
| 33 | +The built-in benchmarks use [cargo bench](https://doc.rust-lang.org/nightly/unstable-book/library-features/test.html) |
| 34 | +and can be found in the `benches` directory for `core` and `alloc` and in `test` modules in `std`. |
| 35 | + |
| 36 | +The benchmarks are automatically executed run in a loop by `Bencher::iter` to average the runtime over many loop-iterations. |
| 37 | +For CPU-bound microbenchmarks the runtime of a single iteration should be in the range of nano- to microseconds. |
| 38 | + |
| 39 | +To run a specific can be invoked without recompiling rustc |
| 40 | +via `./x bench library/<lib> --stage 0 --test-args <benchmark name>`. |
| 41 | + |
| 42 | +`cargo bench` measures wall-time. This often is good enough, but small changes such as saving a few instructions |
| 43 | +in a bigger function can get drowned out by system noise. In such cases the following changes can make runs more |
| 44 | +reproducible: |
| 45 | + |
| 46 | +* disable incremental builds in `config.toml` |
| 47 | +* build std and the benchmarks with `RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1"` |
| 48 | +* ensure the system is as idle as possible |
| 49 | +* [disable ASLR](https://man7.org/linux/man-pages/man8/setarch.8.html) |
| 50 | +* [pinning](https://man7.org/linux/man-pages/man1/taskset.1.html) the benchmark process to a specific core |
| 51 | +* change the CPU [scaling governor](https://wiki.archlinux.org/title/CPU_frequency_scaling#Scaling_governors) |
| 52 | + to a fixed-frequency one (`performance` or `powersave`) |
| 53 | +* [disable clock boosts](https://wiki.archlinux.org/title/CPU_frequency_scaling#Configuring_frequency_boosting), |
| 54 | + especially on thermal-limited systems such as laptops |
| 55 | + |
| 56 | +## Standalone tests |
| 57 | + |
| 58 | +If `x` or the cargo benchmark harness get in the way it can be useful to extract the benchmark into a separate crate, |
| 59 | +e.g. to run it under `perf stat` or cachegrind. |
| 60 | + |
| 61 | +Build the standard library and link [stage0-sysroot](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain) |
| 62 | +as rustup toolchain and then use that to build the standalone benchmark with a modified standard library. |
| 63 | + |
| 64 | +If the std rebuild times are too long for fast iteration it can be useful to not only extract the benchmark but also |
| 65 | +the code under test into a separate crate. |
| 66 | + |
| 67 | +## Running under perf-record |
| 68 | + |
| 69 | +If extracting the code into a separate crate is impractical one can first build the benchmark and then run it again |
| 70 | +under `perf record` and then drill down to the benchmark kernel with `perf report`. |
| 71 | + |
| 72 | +```terminal,ignore |
| 73 | +# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations |
| 74 | +$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2" |
| 75 | +
|
| 76 | +# build benchmark without running it |
| 77 | +$ ./x bench --stage 0 library/core/ --test-args skipallbenches |
| 78 | +
|
| 79 | +# run the benchmark under perf |
| 80 | +$ perf record --call-graph dwarf -e instructions ./x bench --stage 0 library/core/ --test-args <benchmark name> |
| 81 | +$ perf report |
| 82 | +``` |
| 83 | + |
| 84 | +By renaming `perf.data` to keep it from getting overwritten by subsequent runs it can be later compared to runs with |
| 85 | +a modified library with `perf diff`. |
| 86 | + |
| 87 | +## comparing assembly |
| 88 | + |
| 89 | +While `perf report` shows assembly of the benchmark code it can sometimes be difficult to get a good overview of what |
| 90 | +changed, especially when multiple benchmarks were affected. As an alternative one can extract and diff the assembly |
| 91 | +directly from the benchmark suite. |
| 92 | + |
| 93 | +```terminal,ignore |
| 94 | +# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations |
| 95 | +$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2" |
| 96 | +
|
| 97 | +# build benchmark libs |
| 98 | +$ ./x bench --stage 0 library/core/ --test-args skipallbenches |
| 99 | +
|
| 100 | +# this should print something like the following |
| 101 | +Running benches/lib.rs (build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a) |
| 102 | +
|
| 103 | +# get the assembly for all the benchmarks |
| 104 | +$ objdump --source --disassemble --wide --no-show-raw-insn --no-addresses \ |
| 105 | + build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a \ |
| 106 | + | rustfilt > baseline.asm |
| 107 | +
|
| 108 | +# switch to the branch with the changes |
| 109 | +$ git switch feature-branch |
| 110 | +
|
| 111 | +# repeat the procedure above |
| 112 | +$ ./x bench ... |
| 113 | +$ objdump ... > changes.asm |
| 114 | +
|
| 115 | +# compare output |
| 116 | +$ kdiff3 baseline.asm changes.asm |
| 117 | +``` |
| 118 | + |
| 119 | +This can also be applied to standalone benchmarks. |
0 commit comments