You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Addresses two classes of icache thrash present in the interrupt service
path, e.g.:
```asm
let mut prios = [0u128; 16];
40380d44: ec840513 addi a0,s0,-312
40380d48: 10000613 li a2,256
40380d4c: ec840b93 addi s7,s0,-312
40380d50: 4581 li a1,0
40380d52: 01c85097 auipc ra,0x1c85
40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset>
```
and
```asm
prios
40380f9c: dc840513 addi a0,s0,-568
40380fa0: ec840593 addi a1,s0,-312
40380fa4: 10000613 li a2,256
40380fa8: dc840493 addi s1,s0,-568
40380fac: 01c85097 auipc ra,0x1c85
40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy>
```
As an added bonus, performance of the whole program improves
dramatically with these routines 1) reimplemented for the esp32 RISC-V
µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit
calls to these functions, and the versions that ship with
compiler-builtins are [highly tuned] for other platforms. It seems like
the expectation is that the compiler-builtins versions are "reasonable
defaults," and they are [weakly linked] specifically to allow the kind
of domain-specific overrides as here.
In the context of the 'c3, this ends up producing a fairly large
implementation that adds a lot of frequent cache pressure for minimal
wins:
```readelf
Num: Value Size Type Bind Vis Ndx Name
27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy
27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset
28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy
28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset
```
NB: these implementations are broken when targeting unaligned
loads/stores across the instruction bus; at least in my testing this
hasn't been a problem, because they are simply never invoked in that
context.
Additionally, these are just about the simplest possible
implementations, with word-sized copies being the only concession made
to runtime performance. Even a small amount of additional effort would
probably yield fairly massive wins, as three- or four-instruction hot
loops like these are basically pathological for the 'c3's pipeline
implementation that seems to predict all branches as "never taken."
However: there is a real danger in overtraining on the microbenchmarks here, too,
as I would expect almost no one has a program whose runtime is dominated
by these functions. Making these functions larger and more complex to
eke out wins from architectural niches makes LLVM much less willing to
inline them, costing additional function calls and preventing e.g. dead code
elimination for always-aligned addresses or automatic loop unrolling,
etc.
[highly tuned]: rust-lang/compiler-builtins#405
[weakly linked]: rust-lang/compiler-builtins#339 (comment)
0 commit comments