x86_64 memcmp may recurse infinitely without SSE #470

Demindiro · 2022-06-10T07:36:28Z

While looking at assembly in my kernel (which has SSE disabled) I noticed that there is an infinite recursion in memcmp if a string of 32 bytes or longer is passed:

...
ffff800000029d84:       48 83 fa 20             cmp    rdx,0x20
ffff800000029d88:       0f 82 ae 00 00 00       jb     ffff800000029e3c <compiler_builtins::mem::memcmp+0xcc>
ffff800000029d8e:       48 89 54 24 10          mov    QWORD PTR [rsp+0x10],rdx
ffff800000029d93:       49 89 d6                mov    r14,rdx
ffff800000029d96:       49 c1 ee 05             shr    r14,0x5
ffff800000029d9a:       49 c1 e6 05             shl    r14,0x5
ffff800000029d9e:       4a 8d 04 33             lea    rax,[rbx+r14*1]
ffff800000029da2:       48 89 44 24 08          mov    QWORD PTR [rsp+0x8],rax
ffff800000029da7:       31 ed                   xor    ebp,ebp
ffff800000029da9:       4c 8d 64 24 18          lea    r12,[rsp+0x18]
ffff800000029dae:       4c 8d 6c 24 38          lea    r13,[rsp+0x38]
ffff800000029db3:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
ffff800000029dba:       00 00 00 
ffff800000029dbd:       0f 1f 00                nop    DWORD PTR [rax]
        tmp.assume_init()
ffff800000029dc0:       48 8b 44 2b 18          mov    rax,QWORD PTR [rbx+rbp*1+0x18]
ffff800000029dc5:       48 89 44 24 30          mov    QWORD PTR [rsp+0x30],rax
ffff800000029dca:       48 8b 44 2b 10          mov    rax,QWORD PTR [rbx+rbp*1+0x10]
ffff800000029dcf:       48 89 44 24 28          mov    QWORD PTR [rsp+0x28],rax
ffff800000029dd4:       48 8b 04 2b             mov    rax,QWORD PTR [rbx+rbp*1]
ffff800000029dd8:       48 8b 4c 2b 08          mov    rcx,QWORD PTR [rbx+rbp*1+0x8]
ffff800000029ddd:       48 89 4c 24 20          mov    QWORD PTR [rsp+0x20],rcx
ffff800000029de2:       48 89 44 24 18          mov    QWORD PTR [rsp+0x18],rax
ffff800000029de7:       49 8b 44 2f 18          mov    rax,QWORD PTR [r15+rbp*1+0x18]
ffff800000029dec:       48 89 44 24 50          mov    QWORD PTR [rsp+0x50],rax
ffff800000029df1:       49 8b 44 2f 10          mov    rax,QWORD PTR [r15+rbp*1+0x10]
ffff800000029df6:       48 89 44 24 48          mov    QWORD PTR [rsp+0x48],rax
ffff800000029dfb:       49 8b 04 2f             mov    rax,QWORD PTR [r15+rbp*1]
ffff800000029dff:       49 8b 4c 2f 08          mov    rcx,QWORD PTR [r15+rbp*1+0x8]
ffff800000029e04:       48 89 4c 24 40          mov    QWORD PTR [rsp+0x40],rcx
ffff800000029e09:       48 89 44 24 38          mov    QWORD PTR [rsp+0x38],rax
ffff800000029e0e:       ba 20 00 00 00          mov    edx,0x20
ffff800000029e13:       4c 89 e7                mov    rdi,r12
ffff800000029e16:       4c 89 ee                mov    rsi,r13
ffff800000029e19:       e8 72 fe ff ff          call   ffff800000029c90 <memcmp>
...

This seems to be caused by the comparison between [u128; 2], which internally calls core::intrinsics::raw_eq. The documentation states:

Above some backend-decided threshold this will emit calls to memcmp, like slice equality does, instead of causing massive code size.

Replacing it with (u128, u128) does not cause a call to memcmp to be emitted though it severely impacts performance (~2x slower):

[u128; 2]

test memcmp_rust_1048576              ... bench:      24,756 ns/iter (+/- 238) = 42356 MB/s
test memcmp_rust_16                   ... bench:           3 ns/iter (+/- 0) = 5333 MB/s
test memcmp_rust_32                   ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_4096                 ... bench:         111 ns/iter (+/- 0) = 36900 MB/s
test memcmp_rust_64                   ... bench:           5 ns/iter (+/- 0) = 12800 MB/s
test memcmp_rust_8                    ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
test memcmp_rust_unaligned_1048575    ... bench:      27,997 ns/iter (+/- 11,037) = 37453 MB/s
test memcmp_rust_unaligned_15         ... bench:           3 ns/iter (+/- 1) = 5333 MB/s
test memcmp_rust_unaligned_31         ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_unaligned_4095       ... bench:         101 ns/iter (+/- 2) = 40554 MB/s
test memcmp_rust_unaligned_63         ... bench:           5 ns/iter (+/- 0) = 12800 MB/s
test memcmp_rust_unaligned_7          ... bench:           3 ns/iter (+/- 0) = 2666 MB/s

(u128, u128)

test memcmp_rust_1048576              ... bench:      51,088 ns/iter (+/- 782) = 20524 MB/s
test memcmp_rust_16                   ... bench:           4 ns/iter (+/- 0) = 4000 MB/s
test memcmp_rust_32                   ... bench:           5 ns/iter (+/- 0) = 6400 MB/s
test memcmp_rust_4096                 ... bench:         170 ns/iter (+/- 3) = 24094 MB/s
test memcmp_rust_64                   ... bench:           6 ns/iter (+/- 0) = 10666 MB/s
test memcmp_rust_8                    ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
test memcmp_rust_unaligned_1048575    ... bench:      56,451 ns/iter (+/- 832) = 18574 MB/s
test memcmp_rust_unaligned_15         ... bench:           4 ns/iter (+/- 0) = 4000 MB/s
test memcmp_rust_unaligned_31         ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_unaligned_4095       ... bench:         198 ns/iter (+/- 7) = 20686 MB/s
test memcmp_rust_unaligned_63         ... bench:           6 ns/iter (+/- 0) = 10666 MB/s
test memcmp_rust_unaligned_7          ... bench:           4 ns/iter (+/- 0) = 2000 MB/s

Leaving out c32() entirely has a lesser but still significant performance impact (~1.3x slower), though compares on smaller strings seem to benefit:

No c32()

test memcmp_rust_1048576              ... bench:      31,405 ns/iter (+/- 426) = 33388 MB/s
test memcmp_rust_16                   ... bench:           3 ns/iter (+/- 0) = 5333 MB/s
test memcmp_rust_32                   ... bench:           3 ns/iter (+/- 0) = 10666 MB/s
test memcmp_rust_4096                 ... bench:         147 ns/iter (+/- 49) = 27863 MB/s
test memcmp_rust_64                   ... bench:           5 ns/iter (+/- 0) = 12800 MB/s
test memcmp_rust_8                    ... bench:           2 ns/iter (+/- 0) = 4000 MB/s
test memcmp_rust_unaligned_1048575    ... bench:      31,616 ns/iter (+/- 2,652) = 33165 MB/s
test memcmp_rust_unaligned_15         ... bench:           3 ns/iter (+/- 0) = 5333 MB/s
test memcmp_rust_unaligned_31         ... bench:           3 ns/iter (+/- 0) = 10666 MB/s
test memcmp_rust_unaligned_4095       ... bench:         139 ns/iter (+/- 35) = 29467 MB/s
test memcmp_rust_unaligned_63         ... bench:           4 ns/iter (+/- 0) = 16000 MB/s
test memcmp_rust_unaligned_7          ... bench:           2 ns/iter (+/- 0) = 4000 MB/s

Given that targets with SSE(2) do not seem to generate a call to memcmp even in debug mode it may be worth to keep using c32() if this feature is present. I don't know if it is worth the gamble though.

diff --git a/src/mem/x86_64.rs b/src/mem/x86_64.rs
index 4d2f6e5..e295e89 100644
--- a/src/mem/x86_64.rs
+++ b/src/mem/x86_64.rs
@@ -143,5 +143,9 @@ pub unsafe fn compare_bytes(a: *const u8, b: *const u8, n: usize) -> i32 {
     let c8 = |a: *const u64, b, n| cmp(a, b, n, c4);
     let c16 = |a: *const u128, b, n| cmp(a, b, n, c8);
     let c32 = |a: *const [u128; 2], b, n| cmp(a, b, n, c16);
-    c32(a.cast(), b.cast(), n)
+    if cfg!(target_feature = "sse2") {
+        c32(a.cast(), b.cast(), n)
+    } else {
+        c16(a.cast(), b.cast(), n)
+    }
 }

The text was updated successfully, but these errors were encountered:

Amanieu · 2022-06-10T20:08:46Z

I think it's fine to add the code that you suggest, with a clear comment that explains why it was done and that it depends on an internal implementation detail of LLVM. Fundamentally I don't think we have a way to guarantee that LLVM isn't calling the same builtin function recursively.

bjorn3 · 2022-06-10T23:05:12Z

This crate is marked with #![no_builtins] which afaik should prevent LLVM from emitting any intrinsic calls, including memcmp calls. I think this recursion should be considered a bug in LLVM.

Fixes rust-lang#470

Demindiro · 2022-06-11T07:23:14Z

This crate is marked with #![no_builtins] which afaik should prevent LLVM from emitting any intrinsic calls, including memcmp calls. I think this recursion should be considered a bug in LLVM.

I think the culprit is rustc itself, since it explicitly emits a call to memcmp: https://github.com/rust-lang/rust/blob/c84594661c1b51feb539b479b58bb551fcf8e19a/compiler/rustc_codegen_llvm/src/intrinsic.rs#L331

Add `fmodf128`

Demindiro changed the title ~~x86_64 memcmp may recurse infinitely~~ x86_64 memcmp may recurse infinitely without SSE Jun 10, 2022

Demindiro added a commit to Demindiro/compiler-builtins that referenced this issue Jun 11, 2022

Fix infinite recursion in x86_64 memcmp if SSE2 is not present

3c67f04

Fixes rust-lang#470

Demindiro mentioned this issue Jun 11, 2022

Fix infinite recursion in x86_64 memcmp if SSE2 is not present #471

Merged

Amanieu closed this as completed in #471 Jun 12, 2022

sunfishcode mentioned this issue Aug 9, 2022

Stack overflow in memcmp on x86_64 when used in extern "C" memcmp implementation with -Zbuild-std #487

Closed

tgross35 added a commit to tgross35/compiler-builtins that referenced this issue Feb 24, 2025

Merge pull request rust-lang#470 from tgross35/f128-fmod

614c00d

Add `fmodf128`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86_64 memcmp may recurse infinitely without SSE #470

x86_64 memcmp may recurse infinitely without SSE #470

Demindiro commented Jun 10, 2022

Amanieu commented Jun 10, 2022

bjorn3 commented Jun 10, 2022

Demindiro commented Jun 11, 2022

x86_64 memcmp may recurse infinitely without SSE #470

x86_64 memcmp may recurse infinitely without SSE #470

Comments

Demindiro commented Jun 10, 2022

Amanieu commented Jun 10, 2022

bjorn3 commented Jun 10, 2022

Demindiro commented Jun 11, 2022