Skip to content

Commit b61d0a2

Browse files
authored
Merge pull request #45 from the8472/perf-docs
add a page on optimizations and profiling
2 parents 28e8f60 + 722cb2f commit b61d0a2

File tree

2 files changed

+120
-0
lines changed

2 files changed

+120
-0
lines changed

src/SUMMARY.md

+1
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@
1212
---
1313

1414
- [Building and debugging libraries](./development/building-and-debugging.md)
15+
- [Performance optimizations and benchmarking](./development/perf-benchmarking.md)
1516

1617

1718
---

src/development/perf-benchmarking.md

+119
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# Library optimizations and benchmarking
2+
3+
Recommended reading: [The Rust performance book](https://nnethercote.github.io/perf-book/title-page.html)
4+
5+
## What to optimize
6+
7+
It's preferred to optimize code that shows up as significant in real-world code.
8+
E.g. it's more beneficial to speed up `[T]::sort` than it is to shave off a small allocation in `Command::spawn`
9+
because the latter is dominated by its syscall cost.
10+
11+
Issues about slow library code are labeled as [I-slow T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-slow+label%3AT-libs)
12+
and those about code size as [I-heavy T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-heavy+label%3AT-libs)
13+
14+
## Vectorization
15+
16+
Currently only baseline target features (e.g. SSE2 on x86_64-unknown-linux-gnu) can be used in core and alloc because
17+
runtime feature-detection is only available in std.
18+
Where possible the preferred way to achieve vectorization is by shaping code in a way that the compiler
19+
backend's auto-vectorization passes can understand. This benefits user crates compiled with additional target features
20+
when they instantiate generic library functions, e.g. iterators.
21+
22+
## rustc-perf
23+
24+
For parts of the standard library that are heavily used by rustc itself it can be convenient to use
25+
[the benchmark server](https://github.com/rust-lang/rustc-perf/tree/master/collector#benchmarking).
26+
27+
Since it only measures compile-time but not runtime performance of crates it can't be used to benchmark for features
28+
that aren't used by the compiler, e.g. floating point code, linked lists, mpsc channels, etc.
29+
For those explicit benchmarks must be written or extracted from real-world code.
30+
31+
## Built-in Microbenchmarks
32+
33+
The built-in benchmarks use [cargo bench](https://doc.rust-lang.org/nightly/unstable-book/library-features/test.html)
34+
and can be found in the `benches` directory for `core` and `alloc` and in `test` modules in `std`.
35+
36+
The benchmarks are automatically executed run in a loop by `Bencher::iter` to average the runtime over many loop-iterations.
37+
For CPU-bound microbenchmarks the runtime of a single iteration should be in the range of nano- to microseconds.
38+
39+
To run a specific can be invoked without recompiling rustc
40+
via `./x bench library/<lib> --stage 0 --test-args <benchmark name>`.
41+
42+
`cargo bench` measures wall-time. This often is good enough, but small changes such as saving a few instructions
43+
in a bigger function can get drowned out by system noise. In such cases the following changes can make runs more
44+
reproducible:
45+
46+
* disable incremental builds in `config.toml`
47+
* build std and the benchmarks with `RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1"`
48+
* ensure the system is as idle as possible
49+
* [disable ASLR](https://man7.org/linux/man-pages/man8/setarch.8.html)
50+
* [pinning](https://man7.org/linux/man-pages/man1/taskset.1.html) the benchmark process to a specific core
51+
* change the CPU [scaling governor](https://wiki.archlinux.org/title/CPU_frequency_scaling#Scaling_governors)
52+
to a fixed-frequency one (`performance` or `powersave`)
53+
* [disable clock boosts](https://wiki.archlinux.org/title/CPU_frequency_scaling#Configuring_frequency_boosting),
54+
especially on thermal-limited systems such as laptops
55+
56+
## Standalone tests
57+
58+
If `x` or the cargo benchmark harness get in the way it can be useful to extract the benchmark into a separate crate,
59+
e.g. to run it under `perf stat` or cachegrind.
60+
61+
Build the standard library and link [stage0-sysroot](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain)
62+
as rustup toolchain and then use that to build the standalone benchmark with a modified standard library.
63+
64+
If the std rebuild times are too long for fast iteration it can be useful to not only extract the benchmark but also
65+
the code under test into a separate crate.
66+
67+
## Running under perf-record
68+
69+
If extracting the code into a separate crate is impractical one can first build the benchmark and then run it again
70+
under `perf record` and then drill down to the benchmark kernel with `perf report`.
71+
72+
```terminal,ignore
73+
# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations
74+
$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2"
75+
76+
# build benchmark without running it
77+
$ ./x bench --stage 0 library/core/ --test-args skipallbenches
78+
79+
# run the benchmark under perf
80+
$ perf record --call-graph dwarf -e instructions ./x bench --stage 0 library/core/ --test-args <benchmark name>
81+
$ perf report
82+
```
83+
84+
By renaming `perf.data` to keep it from getting overwritten by subsequent runs it can be later compared to runs with
85+
a modified library with `perf diff`.
86+
87+
## comparing assembly
88+
89+
While `perf report` shows assembly of the benchmark code it can sometimes be difficult to get a good overview of what
90+
changed, especially when multiple benchmarks were affected. As an alternative one can extract and diff the assembly
91+
directly from the benchmark suite.
92+
93+
```terminal,ignore
94+
# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations
95+
$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2"
96+
97+
# build benchmark libs
98+
$ ./x bench --stage 0 library/core/ --test-args skipallbenches
99+
100+
# this should print something like the following
101+
Running benches/lib.rs (build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a)
102+
103+
# get the assembly for all the benchmarks
104+
$ objdump --source --disassemble --wide --no-show-raw-insn --no-addresses \
105+
build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a \
106+
| rustfilt > baseline.asm
107+
108+
# switch to the branch with the changes
109+
$ git switch feature-branch
110+
111+
# repeat the procedure above
112+
$ ./x bench ...
113+
$ objdump ... > changes.asm
114+
115+
# compare output
116+
$ kdiff3 baseline.asm changes.asm
117+
```
118+
119+
This can also be applied to standalone benchmarks.

0 commit comments

Comments
 (0)