From fbb1d076c739d8c7b43fe171259cc2bf3c33d4b6 Mon Sep 17 00:00:00 2001 From: The 8472 Date: Mon, 3 Oct 2022 13:56:33 +0200 Subject: [PATCH 1/2] add a page on optimizations and profiling --- src/SUMMARY.md | 1 + src/development/perf-benchmarking.md | 117 +++++++++++++++++++++++++++ 2 files changed, 118 insertions(+) create mode 100644 src/development/perf-benchmarking.md diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 5b0860d..955f7fa 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -12,6 +12,7 @@ --- - [Building and debugging libraries](./development/building-and-debugging.md) +- [Performance optimizations and benchmarking](./development/perf-benchmarking.md) --- diff --git a/src/development/perf-benchmarking.md b/src/development/perf-benchmarking.md new file mode 100644 index 0000000..c232019 --- /dev/null +++ b/src/development/perf-benchmarking.md @@ -0,0 +1,117 @@ +# Library optimizations and benchmarking + +Recommended reading: [The Rust performance book](https://nnethercote.github.io/perf-book/title-page.html) + +## What to optimize + +It's preferred to optimize code that shows up as significant in real-world code. +E.g. it's more beneficial to speed up `[T]::sort` than it is to shave off a small allocation in `Command::spawn` +because the latter is dominated by its syscall cost. + +Issues about slow library code are labeled as [I-slow T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-slow+label%3AT-libs) +and those about code size as [I-heavy T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-heavy+label%3AT-libs) + +## Vectorization + +Currently explicit SIMD features can't be used in alloc or core because runtime feature-detection is only available in std +and they are compiled with each target's baseline feature set. + +Vectorization can only be achieved by shaping code in a way that the compiler backend's auto-vectorization passes can understand. + +## rustc-perf + +For parts of the standard library that are heavily used by rustc itself it can be convenient to use +[the benchmark server](https://github.com/rust-lang/rustc-perf/tree/master/collector#benchmarking). + +Since it only measures compile-time but not runtime performance of crates it can't be used to benchmark for features +that aren't used by the compiler, e.g. floating point code, linked lists, mpsc channels, etc. +For those explicit benchmarks must be written or extracted from real-world code. + +## Built-in Microbenchmarks + +The built-in benchmarks use [cargo bench](https://doc.rust-lang.org/nightly/unstable-book/library-features/test.html) +and can be found in the `benches` directory for `core` and `alloc` and in `test` modules in `std`. + +The benchmarks are automatically executed run in a loop by `Bencher::iter` to average the runtime over many loop-iterations. +For CPU-bound microbenchmarks the runtime of a single iteration should be in the range of nano- to microseconds. + +To run a specific can be invoked without recompiling rustc +via `./x bench library/ --stage 0 --test-args `. + +`cargo bench` measures wall-time. This often is good enough, but small changes such as saving a few instructions +in a bigger function can get drowned out by system noise. In such cases the following changes can make runs more +reproducible: + +* disable incremental builds in `config.toml` +* build std and the benchmarks with `RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1"` +* ensure the system is as idle as possible +* [disable ASLR](https://man7.org/linux/man-pages/man8/setarch.8.html) +* [pinning](https://man7.org/linux/man-pages/man1/taskset.1.html) the benchmark process to a specific core +* [disable clock boosts](https://wiki.archlinux.org/title/CPU_frequency_scaling#Configuring_frequency_boosting), + especially on thermal-limited systems such as laptops + +## Standalone tests + +If `x` or the cargo benchmark harness get in the way it can be useful to extract the benchmark into a separate crate, +e.g. to run it under `perf stat` or cachegrind. + +Build and link the [stage1](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain) +compiler as rustup toolchain and then use that to build the standalone benchmark with a modified standard library. + +[Currently](https://github.com/rust-lang/rust/issues/101691) there is no convenient way to invoke a stage0 toolchain with +a modified standard library. To avoid the compiler rebuild it can be useful to not only extract the benchmark but also +the code under test into a separate crate. + +## Running under perf-record + +If extracting the code into a separate crate is impractical one can first build the benchmark and then run it again +under `perf record` and then drill down to the benchmark kernel with `perf report`. + +```terminal,ignore +# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations +$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2" + +# build benchmark without running it +$ ./x bench --stage 0 library/core/ --test-args skipallbenches + +# run the benchmark under perf +$ perf record --call-graph dwarf -e instructions ./x bench --stage 0 library/core/ --test-args +$ perf report +``` + +By renaming `perf.data` to keep it from getting overwritten by subsequent runs it can be later compared to runs with +a modified library with `perf diff`. + +## comparing assembly + +While `perf report` shows assembly of the benchmark code it can sometimes be difficult to get a good overview of what +changed, especially when multiple benchmarks were affected. As an alternative one can extract and diff the assembly +directly from the benchmark suite. + +```terminal,ignore +# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations +$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2" + +# build benchmark libs +$ ./x bench --stage 0 library/core/ --test-args skipallbenches + +# this should print something like the following +Running benches/lib.rs (build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a) + +# get the assembly for all the benchmarks +$ objdump --source --disassemble --wide --no-show-raw-insn --no-addresses \ + build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a \ + | rustfilt > baseline.asm + +# switch to the branch with the changes +$ git switch feature-branch + +# repeat the procedure above +$ ./x bench ... +$ objdump ... > changes.asm + +# compare output +$ kdiff3 baseline.asm changes.asm +``` + +This can also be applied to standalone benchmarks. From 722cb2fbc61f87182bd6a6d575390a164034130c Mon Sep 17 00:00:00 2001 From: The 8472 Date: Sat, 18 Feb 2023 15:38:31 +0100 Subject: [PATCH 2/2] - reword vectorization section - mention scaling governors - linking stage0 as rustup toolchain is now supported --- src/development/perf-benchmarking.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/src/development/perf-benchmarking.md b/src/development/perf-benchmarking.md index c232019..dcbb611 100644 --- a/src/development/perf-benchmarking.md +++ b/src/development/perf-benchmarking.md @@ -13,10 +13,11 @@ and those about code size as [I-heavy T-libs](https://github.com/rust-lang/rust/ ## Vectorization -Currently explicit SIMD features can't be used in alloc or core because runtime feature-detection is only available in std -and they are compiled with each target's baseline feature set. - -Vectorization can only be achieved by shaping code in a way that the compiler backend's auto-vectorization passes can understand. +Currently only baseline target features (e.g. SSE2 on x86_64-unknown-linux-gnu) can be used in core and alloc because +runtime feature-detection is only available in std. +Where possible the preferred way to achieve vectorization is by shaping code in a way that the compiler +backend's auto-vectorization passes can understand. This benefits user crates compiled with additional target features +when they instantiate generic library functions, e.g. iterators. ## rustc-perf @@ -47,6 +48,8 @@ reproducible: * ensure the system is as idle as possible * [disable ASLR](https://man7.org/linux/man-pages/man8/setarch.8.html) * [pinning](https://man7.org/linux/man-pages/man1/taskset.1.html) the benchmark process to a specific core +* change the CPU [scaling governor](https://wiki.archlinux.org/title/CPU_frequency_scaling#Scaling_governors) + to a fixed-frequency one (`performance` or `powersave`) * [disable clock boosts](https://wiki.archlinux.org/title/CPU_frequency_scaling#Configuring_frequency_boosting), especially on thermal-limited systems such as laptops @@ -55,11 +58,10 @@ reproducible: If `x` or the cargo benchmark harness get in the way it can be useful to extract the benchmark into a separate crate, e.g. to run it under `perf stat` or cachegrind. -Build and link the [stage1](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain) -compiler as rustup toolchain and then use that to build the standalone benchmark with a modified standard library. +Build the standard library and link [stage0-sysroot](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain) +as rustup toolchain and then use that to build the standalone benchmark with a modified standard library. -[Currently](https://github.com/rust-lang/rust/issues/101691) there is no convenient way to invoke a stage0 toolchain with -a modified standard library. To avoid the compiler rebuild it can be useful to not only extract the benchmark but also +If the std rebuild times are too long for fast iteration it can be useful to not only extract the benchmark but also the code under test into a separate crate. ## Running under perf-record