Skip to content

Commit a0adbc4

Browse files
committed
add a page on optimizations and profiling
1 parent 737cdb1 commit a0adbc4

File tree

2 files changed

+118
-0
lines changed

2 files changed

+118
-0
lines changed

src/SUMMARY.md

+1
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
---
1313

1414
- [Building and debugging libraries](./development/building-and-debugging.md)
15+
- [Performance optimizations and benchmarking](./development/perf-benchmarking.md)
1516

1617

1718
---

src/development/perf-benchmarking.md

+117
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# Library optimizations and benchmarking
2+
3+
Recommended reading: [The Rust performance book](https://nnethercote.github.io/perf-book/title-page.html)
4+
5+
## What to optimize
6+
7+
It's preferred to optimize code that shows up as significant in real-world code.
8+
E.g. it's more beneficial to speed up `[T]::sort` than it is to shave off a small allocation in `Command::spawn`
9+
because the latter is dominated by its syscall cost.
10+
11+
Issues about slow library code are labeled as [I-slow T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-slow+label%3AT-libs)
12+
and those about code size as [I-heavy T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-heavy+label%3AT-libs)
13+
14+
## Vectorization
15+
16+
Currently explicit SIMD features can't be used in alloc or core because runtime feature-detection is only available in std
17+
and they are compiled with each target's baseline feature set.
18+
19+
Vectorization can only be achieved by shaping code in a way that the compiler backend's auto-vectorization passes can understand.
20+
21+
## rustc-perf
22+
23+
For parts of the standard library that are heavily used by rustc itself it can be convenient to use
24+
[the benchmark server](https://github.com/rust-lang/rustc-perf/tree/master/collector#benchmarking).
25+
26+
Since it only measures compile-time but not runtime performance of crates it can't be used to benchmark for features
27+
that aren't used by the compiler, e.g. floating point code, linked lists, mpsc channels, etc.
28+
For those explicit benchmarks must be written or extracted from real-world code.
29+
30+
## Built-in Microbenchmarks
31+
32+
The built-in benchmarks use [cargo bench](https://doc.rust-lang.org/nightly/unstable-book/library-features/test.html)
33+
and can be found in the `benches` directory for `core` and `alloc` and in `test` modules in `std`.
34+
35+
The benchmarks are automatically executed run in a loop by `Bencher::iter` to average the runtime over many loop-iterations.
36+
For CPU-bound microbenchmarks the runtime of a single iteration should be in the range of nano- to microseconds.
37+
38+
To run a specific can be invoked without recompiling rustc
39+
via `./x bench library/<lib> --stage 0 --test-args <benchmark name>`.
40+
41+
`cargo bench` measures wall-time. This often is good enough, but small changes such as saving a few instructions
42+
in a bigger function can get drowned out by system noise. In such cases the following changes can make runs more
43+
reproducible:
44+
45+
* disable incremental builds in `config.toml`
46+
* build std and the benchmarks with `RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1"`
47+
* ensure the system is as idle as possible
48+
* [disable ASLR](https://man7.org/linux/man-pages/man8/setarch.8.html)
49+
* [pinning](https://man7.org/linux/man-pages/man1/taskset.1.html) the benchmark process to a specific core
50+
* [disable clock boosts](https://wiki.archlinux.org/title/CPU_frequency_scaling#Configuring_frequency_boosting),
51+
especially on thermal-limited systems such as laptops
52+
53+
## Standalone tests
54+
55+
If `x` or the cargo benchmark harness get in the way it can be useful to extract the benchmark into a separate crate,
56+
e.g. to run it under `perf stat` or cachegrind.
57+
58+
Build and link the [stage1](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain)
59+
compiler as rustup toolchain and then use that to build the standalone benchmark with a modified standard library.
60+
61+
[Currently](https://github.com/rust-lang/rust/issues/101691) there is no convenient way to invoke a stage0 toolchain with
62+
a modified standard library. To avoid the compiler rebuild it can be useful to not only extract the benchmark but also
63+
the code under test into a separate crate.
64+
65+
## Running under perf-record
66+
67+
If extracting the code into a separate crate is impractical one can first build the benchmark and then run it again
68+
under `perf record` and then drill down to the benchmark kernel with `perf report`.
69+
70+
```terminal,ignore
71+
# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations
72+
$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2"
73+
74+
# build benchmark without running it
75+
$ ./x bench --stage 0 library/core/ --test-args skipallbenches
76+
77+
# run the benchmark under perf
78+
$ perf record --call-graph dwarf -e instructions ./x bench --stage 0 library/core/ --test-args <benchmark name>
79+
$ perf report
80+
```
81+
82+
By renaming `perf.data` to keep it from getting overwritten by subsequent runs it can be later compared to runs with
83+
a modified library with `perf diff`.
84+
85+
## comparing assembly
86+
87+
While `perf report` shows assembly of the benchmark code it can sometimes be difficult to get a good overview of what
88+
changed, especially when multiple benchmarks were affected. As an alternative one can extract and diff the assembly
89+
directly from the benchmark suite.
90+
91+
```terminal,ignore
92+
# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations
93+
$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2"
94+
95+
# build benchmark libs
96+
$ ./x bench --stage 0 library/core/ --test-args skipallbenches
97+
98+
# this should print something like the following
99+
Running benches/lib.rs (build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a)
100+
101+
# get the assembly for all the benchmarks
102+
$ objdump --source --disassemble --wide --no-show-raw-insn --no-addresses \
103+
build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a \
104+
| rustfilt > baseline.asm
105+
106+
# switch to the branch with the changes
107+
$ git switch feature-branch
108+
109+
# repeat the procedure above
110+
$ ./x bench ...
111+
$ objdump ... > changes.asm
112+
113+
# compare output
114+
$ kdiff3 baseline.asm changes.asm
115+
```
116+
117+
This can also be applied to standalone benchmarks.

0 commit comments

Comments
 (0)