From fbb1d076c739d8c7b43fe171259cc2bf3c33d4b6 Mon Sep 17 00:00:00 2001
From: The 8472 <git@infinite-source.de>
Date: Mon, 3 Oct 2022 13:56:33 +0200
Subject: [PATCH 1/2] add a page on optimizations and profiling

---
 src/SUMMARY.md                       |   1 +
 src/development/perf-benchmarking.md | 117 +++++++++++++++++++++++++++
 2 files changed, 118 insertions(+)
 create mode 100644 src/development/perf-benchmarking.md
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
index 5b0860d..955f7fa 100644
--- a/src/SUMMARY.md
+++ b/src/SUMMARY.md
@@ -12,6 +12,7 @@
 ---
 
 - [Building and debugging libraries](./development/building-and-debugging.md)
+- [Performance optimizations and benchmarking](./development/perf-benchmarking.md)
 
 
 ---
diff --git a/src/development/perf-benchmarking.md b/src/development/perf-benchmarking.md
new file mode 100644
index 0000000..c232019
--- /dev/null
+++ b/src/development/perf-benchmarking.md
@@ -0,0 +1,117 @@
+# Library optimizations and benchmarking
+
+Recommended reading: [The Rust performance book](https://nnethercote.github.io/perf-book/title-page.html)
+
+## What to optimize
+
+It's preferred to optimize code that shows up as significant in real-world code.
+E.g. it's more beneficial to speed up `[T]::sort` than it is to shave off a small allocation in `Command::spawn`
+because the latter is dominated by its syscall cost.
+
+Issues about slow library code are labeled as [I-slow T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-slow+label%3AT-libs)
+and those about code size as [I-heavy T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-heavy+label%3AT-libs)
+
+## Vectorization
+
+Currently explicit SIMD features can't be used in alloc or core because runtime feature-detection is only available in std
+and they are compiled with each target's baseline feature set.
+
+Vectorization can only be achieved by shaping code in a way that the compiler backend's auto-vectorization passes can understand.
+
+## rustc-perf
+
+For parts of the standard library that are heavily used by rustc itself it can be convenient to use
+[the benchmark server](https://github.com/rust-lang/rustc-perf/tree/master/collector#benchmarking).
+
+Since it only measures compile-time but not runtime performance of crates it can't be used to benchmark for features
+that aren't used by the compiler, e.g. floating point code, linked lists, mpsc channels, etc.
+For those explicit benchmarks must be written or extracted from real-world code.
+
+## Built-in Microbenchmarks
+
+The built-in benchmarks use [cargo bench](https://doc.rust-lang.org/nightly/unstable-book/library-features/test.html)
+and can be found in the `benches` directory for `core` and `alloc` and in `test` modules in `std`.
+
+The benchmarks are automatically executed run in a loop by `Bencher::iter` to average the runtime over many loop-iterations.
+For CPU-bound microbenchmarks the runtime of a single iteration should be in the range of nano- to microseconds.
+
+To run a specific  can be invoked without recompiling rustc
+via `./x bench library/<lib> --stage 0 --test-args <benchmark name>`.
+
+`cargo bench` measures wall-time. This often is good enough, but small changes such as saving a few instructions
+in a bigger function can get drowned out by system noise. In such cases the following changes can make runs more
+reproducible:
+
+* disable incremental builds in `config.toml`
+* build std and the benchmarks with `RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1"`
+* ensure the system is as idle as possible
+* [disable ASLR](https://man7.org/linux/man-pages/man8/setarch.8.html)
+* [pinning](https://man7.org/linux/man-pages/man1/taskset.1.html) the benchmark process to a specific core
+* [disable clock boosts](https://wiki.archlinux.org/title/CPU_frequency_scaling#Configuring_frequency_boosting),
+  especially on thermal-limited systems such as laptops
+
+## Standalone tests
+
+If `x` or the cargo benchmark harness get in the way it can be useful to extract the benchmark into a separate crate,
+e.g. to run it under `perf stat` or cachegrind.
+
+Build and link the [stage1](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain)
+compiler as rustup toolchain and then use that to build the standalone benchmark with a modified standard library.
+
+[Currently](https://github.com/rust-lang/rust/issues/101691) there is no convenient way to invoke a stage0 toolchain with
+a modified standard library. To avoid the compiler rebuild it can be useful to not only extract the benchmark but also
+the code under test into a separate crate.
+
+## Running under perf-record
+
+If extracting the code into a separate crate is impractical one can first build the benchmark and then run it again
+under `perf record` and then drill down to the benchmark kernel with `perf report`.
+
+```terminal,ignore
+# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations
+$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2"
+
+# build benchmark without running it
+$ ./x bench --stage 0 library/core/ --test-args skipallbenches
+
+# run the benchmark under perf
+$ perf record --call-graph dwarf -e instructions ./x bench --stage 0 library/core/ --test-args <benchmark name>
+$ perf report
+```
+
+By renaming `perf.data` to keep it from getting overwritten by subsequent runs it can be later compared to runs with
+a modified library with `perf diff`.
+
+## comparing assembly
+
+While `perf report` shows assembly of the benchmark code it can sometimes be difficult to get a good overview of what
+changed, especially when multiple benchmarks were affected. As an alternative one can extract and diff the assembly
+directly from the benchmark suite.
+
+```terminal,ignore
+# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations
+$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2"
+
+# build benchmark libs
+$ ./x bench --stage 0 library/core/ --test-args skipallbenches
+
+# this should print something like the following
+Running benches/lib.rs (build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a)
+
+# get the assembly for all the benchmarks
+$ objdump --source --disassemble --wide --no-show-raw-insn --no-addresses \
+  build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a \
+  | rustfilt > baseline.asm
+
+# switch to the branch with the changes
+$ git switch feature-branch
+
+# repeat the procedure above
+$ ./x bench ...
+$ objdump ... > changes.asm
+
+# compare output
+$ kdiff3 baseline.asm changes.asm
+```
+
+This can also be applied to standalone benchmarks.

From 722cb2fbc61f87182bd6a6d575390a164034130c Mon Sep 17 00:00:00 2001
From: The 8472 <git@infinite-source.de>
Date: Sat, 18 Feb 2023 15:38:31 +0100
Subject: [PATCH 2/2] - reword vectorization section - mention scaling
 governors - linking stage0 as rustup toolchain is now supported

---
 src/development/perf-benchmarking.md | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/src/development/perf-benchmarking.md b/src/development/perf-benchmarking.md
index c232019..dcbb611 100644
--- a/src/development/perf-benchmarking.md
+++ b/src/development/perf-benchmarking.md
@@ -13,10 +13,11 @@ and those about code size as [I-heavy T-libs](https://github.com/rust-lang/rust/
 
 ## Vectorization
 
-Currently explicit SIMD features can't be used in alloc or core because runtime feature-detection is only available in std
-and they are compiled with each target's baseline feature set.
-
-Vectorization can only be achieved by shaping code in a way that the compiler backend's auto-vectorization passes can understand.
+Currently only baseline target features (e.g. SSE2 on x86_64-unknown-linux-gnu) can be used in core and alloc because
+runtime feature-detection is only available in std.
+Where possible the preferred way to achieve vectorization is by shaping code in a way that the compiler
+backend's auto-vectorization passes can understand. This benefits user crates compiled with additional target features
+when they instantiate generic library functions, e.g. iterators.
 
 ## rustc-perf
 
@@ -47,6 +48,8 @@ reproducible:
 * ensure the system is as idle as possible
 * [disable ASLR](https://man7.org/linux/man-pages/man8/setarch.8.html)
 * [pinning](https://man7.org/linux/man-pages/man1/taskset.1.html) the benchmark process to a specific core
+* change the CPU [scaling governor](https://wiki.archlinux.org/title/CPU_frequency_scaling#Scaling_governors)
+  to a fixed-frequency one (`performance` or `powersave`)
 * [disable clock boosts](https://wiki.archlinux.org/title/CPU_frequency_scaling#Configuring_frequency_boosting),
   especially on thermal-limited systems such as laptops
 
@@ -55,11 +58,10 @@ reproducible:
 If `x` or the cargo benchmark harness get in the way it can be useful to extract the benchmark into a separate crate,
 e.g. to run it under `perf stat` or cachegrind.
 
-Build and link the [stage1](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain)
-compiler as rustup toolchain and then use that to build the standalone benchmark with a modified standard library.
+Build the standard library and link [stage0-sysroot](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain)
+as rustup toolchain and then use that to build the standalone benchmark with a modified standard library.
 
-[Currently](https://github.com/rust-lang/rust/issues/101691) there is no convenient way to invoke a stage0 toolchain with
-a modified standard library. To avoid the compiler rebuild it can be useful to not only extract the benchmark but also
+If the std rebuild times are too long for fast iteration it can be useful to not only extract the benchmark but also
 the code under test into a separate crate.
 
 ## Running under perf-record