feat(esp32c3): implement mem{set,cpy} in SRAM

sethp · sethp · commit e803b7e2f694 · 2023-05-14T08:44:32.000-07:00
Addresses two classes of icache thrash present in the interrupt service path, e.g.: ```asm let mut prios = [0u128; 16]; 40380d44: ec840513 addi a0,s0,-312 40380d48: 10000613 li a2,256 40380d4c: ec840b93 addi s7,s0,-312 40380d50: 4581 li a1,0 40380d52: 01c85097 auipc ra,0x1c85 40380d56: 11e080e7 jalr 286(ra) # 42005e70 <memset> ``` and ```asm prios 40380f9c: dc840513 addi a0,s0,-568 40380fa0: ec840593 addi a1,s0,-312 40380fa4: 10000613 li a2,256 40380fa8: dc840493 addi s1,s0,-568 40380fac: 01c85097 auipc ra,0x1c85 40380fb0: eae080e7 jalr -338(ra) # 42005e5a <memcpy> ``` As an added bonus, performance of the whole program improves dramatically with these routines 1) reimplemented for the esp32 RISC-V µarch and 2) in SRAM: `rustc` is quite happy to emit lots of implicit calls to these functions, and the versions that ship with compiler-builtins are [highly tuned] for other platforms. It seems like the expectation is that the compiler-builtins versions are "reasonable defaults," and they are [weakly linked] specifically to allow the kind of domain-specific overrides as here. In the context of the 'c3, this ends up producing a fairly large implementation that adds a lot of frequent cache pressure for minimal wins: ```readelf Num: Value Size Type Bind Vis Ndx Name 27071: 42005f72 22 FUNC LOCAL HIDDEN 3 memcpy 27072: 42005f88 22 FUNC LOCAL HIDDEN 3 memset 28853: 42005f9e 186 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memcpy 28854: 42006058 110 FUNC LOCAL HIDDEN 3 compiler_builtins::mem::memset ``` NB: these implementations are broken when targeting unaligned loads/stores across the instruction bus; at least in my testing this hasn't been a problem, because they are simply never invoked in that context. Additionally, these are just about the simplest possible implementations, with word-sized copies being the only concession made to runtime performance. Even a small amount of additional effort would probably yield fairly massive wins, as three- or four-instruction hot loops like these are basically pathological for the 'c3's pipeline implementation that seems to predict all branches as "never taken." However: there is a real danger in overtraining on the microbenchmarks here, too, as I would expect almost no one has a program whose runtime is dominated by these functions. Making these functions larger and more complex to eke out wins from architectural niches makes LLVM much less willing to inline them, costing additional function calls and preventing e.g. dead code elimination for always-aligned addresses or automatic loop unrolling, etc. [highly tuned]: rust-lang/compiler-builtins#405 [weakly linked]: rust-lang/compiler-builtins#339 (comment)
diff --git a/esp32c3-hal/src/lib.rs b/esp32c3-hal/src/lib.rs
@@ -16,6 +16,46 @@ pub mod analog {
     pub use esp_hal_common::analog::{AvailableAnalog, SarAdcExt};
 }
 
+mod mem {
+    #[no_mangle]
+    #[link_section = ".rwtext"]
+    pub unsafe extern "C" fn memcpy(dest: *mut u8, src: *const u8, n: usize) -> *mut u8 {
+        let r = dest;
+        let (n, m) = (n / 4, n % 4);
+        for i in 0..m {
+            *dest.add(i) = *src.add(i);
+        }
+        let dest = dest.add(m).cast::<usize>();
+        let src = src.add(m).cast::<usize>();
+        for i in 0..n {
+            *dest.add(i) = *src.add(i);
+        }
+        r
+    }
+
+    #[no_mangle]
+    #[link_section = ".rwtext"]
+    pub unsafe extern "C" fn memset(
+        p: *mut u8,
+        c: i32, // equivalent to a c int
+        n: usize,
+    ) -> *mut u8 {
+        let s = p;
+        let (n, m) = (n / 4, n % 4);
+        let b = c as u8;
+        for i in 0..m {
+            *p.add(i) = b
+        }
+        let p = p.add(m).cast::<usize>();
+
+        let w = usize::from_ne_bytes([b; 4]);
+        for i in 0..n {
+            *p.add(i) = w;
+        }
+        s
+    }
+}
+
 extern "C" {
     cfg_if::cfg_if! {
         if #[cfg(feature = "mcu-boot")] {