-
Notifications
You must be signed in to change notification settings - Fork 36
add a page on optimizations and profiling #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
# Library optimizations and benchmarking | ||
|
||
Recommended reading: [The Rust performance book](https://nnethercote.github.io/perf-book/title-page.html) | ||
|
||
## What to optimize | ||
|
||
It's preferred to optimize code that shows up as significant in real-world code. | ||
E.g. it's more beneficial to speed up `[T]::sort` than it is to shave off a small allocation in `Command::spawn` | ||
because the latter is dominated by its syscall cost. | ||
|
||
Issues about slow library code are labeled as [I-slow T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-slow+label%3AT-libs) | ||
and those about code size as [I-heavy T-libs](https://github.com/rust-lang/rust/issues?q=is%3Aissue+is%3Aopen+label%3AI-heavy+label%3AT-libs) | ||
|
||
## Vectorization | ||
|
||
Currently only baseline target features (e.g. SSE2 on x86_64-unknown-linux-gnu) can be used in core and alloc because | ||
runtime feature-detection is only available in std. | ||
Where possible the preferred way to achieve vectorization is by shaping code in a way that the compiler | ||
backend's auto-vectorization passes can understand. This benefits user crates compiled with additional target features | ||
when they instantiate generic library functions, e.g. iterators. | ||
|
||
## rustc-perf | ||
|
||
For parts of the standard library that are heavily used by rustc itself it can be convenient to use | ||
[the benchmark server](https://github.com/rust-lang/rustc-perf/tree/master/collector#benchmarking). | ||
|
||
Since it only measures compile-time but not runtime performance of crates it can't be used to benchmark for features | ||
that aren't used by the compiler, e.g. floating point code, linked lists, mpsc channels, etc. | ||
For those explicit benchmarks must be written or extracted from real-world code. | ||
|
||
## Built-in Microbenchmarks | ||
|
||
The built-in benchmarks use [cargo bench](https://doc.rust-lang.org/nightly/unstable-book/library-features/test.html) | ||
and can be found in the `benches` directory for `core` and `alloc` and in `test` modules in `std`. | ||
|
||
The benchmarks are automatically executed run in a loop by `Bencher::iter` to average the runtime over many loop-iterations. | ||
For CPU-bound microbenchmarks the runtime of a single iteration should be in the range of nano- to microseconds. | ||
|
||
To run a specific can be invoked without recompiling rustc | ||
via `./x bench library/<lib> --stage 0 --test-args <benchmark name>`. | ||
|
||
`cargo bench` measures wall-time. This often is good enough, but small changes such as saving a few instructions | ||
in a bigger function can get drowned out by system noise. In such cases the following changes can make runs more | ||
reproducible: | ||
|
||
* disable incremental builds in `config.toml` | ||
* build std and the benchmarks with `RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1"` | ||
* ensure the system is as idle as possible | ||
* [disable ASLR](https://man7.org/linux/man-pages/man8/setarch.8.html) | ||
* [pinning](https://man7.org/linux/man-pages/man1/taskset.1.html) the benchmark process to a specific core | ||
* change the CPU [scaling governor](https://wiki.archlinux.org/title/CPU_frequency_scaling#Scaling_governors) | ||
to a fixed-frequency one (`performance` or `powersave`) | ||
* [disable clock boosts](https://wiki.archlinux.org/title/CPU_frequency_scaling#Configuring_frequency_boosting), | ||
especially on thermal-limited systems such as laptops | ||
|
||
## Standalone tests | ||
|
||
If `x` or the cargo benchmark harness get in the way it can be useful to extract the benchmark into a separate crate, | ||
e.g. to run it under `perf stat` or cachegrind. | ||
|
||
Build the standard library and link [stage0-sysroot](https://rustc-dev-guide.rust-lang.org/building/how-to-build-and-run.html#creating-a-rustup-toolchain) | ||
as rustup toolchain and then use that to build the standalone benchmark with a modified standard library. | ||
|
||
If the std rebuild times are too long for fast iteration it can be useful to not only extract the benchmark but also | ||
the code under test into a separate crate. | ||
|
||
## Running under perf-record | ||
|
||
If extracting the code into a separate crate is impractical one can first build the benchmark and then run it again | ||
under `perf record` and then drill down to the benchmark kernel with `perf report`. | ||
|
||
```terminal,ignore | ||
# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations | ||
$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2" | ||
|
||
# build benchmark without running it | ||
$ ./x bench --stage 0 library/core/ --test-args skipallbenches | ||
|
||
# run the benchmark under perf | ||
$ perf record --call-graph dwarf -e instructions ./x bench --stage 0 library/core/ --test-args <benchmark name> | ||
$ perf report | ||
``` | ||
|
||
By renaming `perf.data` to keep it from getting overwritten by subsequent runs it can be later compared to runs with | ||
a modified library with `perf diff`. | ||
|
||
## comparing assembly | ||
|
||
While `perf report` shows assembly of the benchmark code it can sometimes be difficult to get a good overview of what | ||
changed, especially when multiple benchmarks were affected. As an alternative one can extract and diff the assembly | ||
directly from the benchmark suite. | ||
|
||
```terminal,ignore | ||
# 1CGU to reduce inlining changes and code reorderings, debuginfo for source annotations | ||
$ export RUSTFLAGS_BOOTSTRAP="-Ccodegen-units=1 -Cdebuginfo=2" | ||
|
||
# build benchmark libs | ||
$ ./x bench --stage 0 library/core/ --test-args skipallbenches | ||
|
||
# this should print something like the following | ||
Running benches/lib.rs (build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a) | ||
|
||
# get the assembly for all the benchmarks | ||
$ objdump --source --disassemble --wide --no-show-raw-insn --no-addresses \ | ||
build/x86_64-unknown-linux-gnu/stage0-std/x86_64-unknown-linux-gnu/release/deps/corebenches-2199e9a22e7b1f4a \ | ||
| rustfilt > baseline.asm | ||
|
||
# switch to the branch with the changes | ||
$ git switch feature-branch | ||
|
||
# repeat the procedure above | ||
$ ./x bench ... | ||
$ objdump ... > changes.asm | ||
|
||
# compare output | ||
$ kdiff3 baseline.asm changes.asm | ||
``` | ||
|
||
This can also be applied to standalone benchmarks. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also add a link to something like https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/docs/src/linuxtips.md.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of those things may not be relevant to std benchmarks, which are mostly are CPU- or memory-bandwidth-bound and single-threaded. They shouldn't suffer much from swap, IRQs or SMT-siblibgs if you ensured the system is mostly idle since they depend on system activity (well, depends on how many cores one has... maybe core isolation is still worth it).
Scheduling and throttling have the biggest impact in my experience. If we had a benchmark that tried to do a parallel sort on a huge dataset that would be a different story.
Adjusting the scaling governor is a good point.