Skip to content

x86_64 memcmp may recurse infinitely without SSE #470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Demindiro opened this issue Jun 10, 2022 · 3 comments · Fixed by #471
Closed

x86_64 memcmp may recurse infinitely without SSE #470

Demindiro opened this issue Jun 10, 2022 · 3 comments · Fixed by #471

Comments

@Demindiro
Copy link
Contributor

While looking at assembly in my kernel (which has SSE disabled) I noticed that there is an infinite recursion in memcmp if a string of 32 bytes or longer is passed:

...
ffff800000029d84:       48 83 fa 20             cmp    rdx,0x20
ffff800000029d88:       0f 82 ae 00 00 00       jb     ffff800000029e3c <compiler_builtins::mem::memcmp+0xcc>
ffff800000029d8e:       48 89 54 24 10          mov    QWORD PTR [rsp+0x10],rdx
ffff800000029d93:       49 89 d6                mov    r14,rdx
ffff800000029d96:       49 c1 ee 05             shr    r14,0x5
ffff800000029d9a:       49 c1 e6 05             shl    r14,0x5
ffff800000029d9e:       4a 8d 04 33             lea    rax,[rbx+r14*1]
ffff800000029da2:       48 89 44 24 08          mov    QWORD PTR [rsp+0x8],rax
ffff800000029da7:       31 ed                   xor    ebp,ebp
ffff800000029da9:       4c 8d 64 24 18          lea    r12,[rsp+0x18]
ffff800000029dae:       4c 8d 6c 24 38          lea    r13,[rsp+0x38]
ffff800000029db3:       66 2e 0f 1f 84 00 00    nop    WORD PTR cs:[rax+rax*1+0x0]
ffff800000029dba:       00 00 00 
ffff800000029dbd:       0f 1f 00                nop    DWORD PTR [rax]
        tmp.assume_init()
ffff800000029dc0:       48 8b 44 2b 18          mov    rax,QWORD PTR [rbx+rbp*1+0x18]
ffff800000029dc5:       48 89 44 24 30          mov    QWORD PTR [rsp+0x30],rax
ffff800000029dca:       48 8b 44 2b 10          mov    rax,QWORD PTR [rbx+rbp*1+0x10]
ffff800000029dcf:       48 89 44 24 28          mov    QWORD PTR [rsp+0x28],rax
ffff800000029dd4:       48 8b 04 2b             mov    rax,QWORD PTR [rbx+rbp*1]
ffff800000029dd8:       48 8b 4c 2b 08          mov    rcx,QWORD PTR [rbx+rbp*1+0x8]
ffff800000029ddd:       48 89 4c 24 20          mov    QWORD PTR [rsp+0x20],rcx
ffff800000029de2:       48 89 44 24 18          mov    QWORD PTR [rsp+0x18],rax
ffff800000029de7:       49 8b 44 2f 18          mov    rax,QWORD PTR [r15+rbp*1+0x18]
ffff800000029dec:       48 89 44 24 50          mov    QWORD PTR [rsp+0x50],rax
ffff800000029df1:       49 8b 44 2f 10          mov    rax,QWORD PTR [r15+rbp*1+0x10]
ffff800000029df6:       48 89 44 24 48          mov    QWORD PTR [rsp+0x48],rax
ffff800000029dfb:       49 8b 04 2f             mov    rax,QWORD PTR [r15+rbp*1]
ffff800000029dff:       49 8b 4c 2f 08          mov    rcx,QWORD PTR [r15+rbp*1+0x8]
ffff800000029e04:       48 89 4c 24 40          mov    QWORD PTR [rsp+0x40],rcx
ffff800000029e09:       48 89 44 24 38          mov    QWORD PTR [rsp+0x38],rax
ffff800000029e0e:       ba 20 00 00 00          mov    edx,0x20
ffff800000029e13:       4c 89 e7                mov    rdi,r12
ffff800000029e16:       4c 89 ee                mov    rsi,r13
ffff800000029e19:       e8 72 fe ff ff          call   ffff800000029c90 <memcmp>
...

This seems to be caused by the comparison between [u128; 2], which internally calls core::intrinsics::raw_eq. The documentation states:

Above some backend-decided threshold this will emit calls to memcmp, like slice equality does, instead of causing massive code size.

Replacing it with (u128, u128) does not cause a call to memcmp to be emitted though it severely impacts performance (~2x slower):

[u128; 2]
test memcmp_rust_1048576              ... bench:      24,756 ns/iter (+/- 238) = 42356 MB/s
test memcmp_rust_16                   ... bench:           3 ns/iter (+/- 0) = 5333 MB/s
test memcmp_rust_32                   ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_4096                 ... bench:         111 ns/iter (+/- 0) = 36900 MB/s
test memcmp_rust_64                   ... bench:           5 ns/iter (+/- 0) = 12800 MB/s
test memcmp_rust_8                    ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
test memcmp_rust_unaligned_1048575    ... bench:      27,997 ns/iter (+/- 11,037) = 37453 MB/s
test memcmp_rust_unaligned_15         ... bench:           3 ns/iter (+/- 1) = 5333 MB/s
test memcmp_rust_unaligned_31         ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_unaligned_4095       ... bench:         101 ns/iter (+/- 2) = 40554 MB/s
test memcmp_rust_unaligned_63         ... bench:           5 ns/iter (+/- 0) = 12800 MB/s
test memcmp_rust_unaligned_7          ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
(u128, u128)
test memcmp_rust_1048576              ... bench:      51,088 ns/iter (+/- 782) = 20524 MB/s
test memcmp_rust_16                   ... bench:           4 ns/iter (+/- 0) = 4000 MB/s
test memcmp_rust_32                   ... bench:           5 ns/iter (+/- 0) = 6400 MB/s
test memcmp_rust_4096                 ... bench:         170 ns/iter (+/- 3) = 24094 MB/s
test memcmp_rust_64                   ... bench:           6 ns/iter (+/- 0) = 10666 MB/s
test memcmp_rust_8                    ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
test memcmp_rust_unaligned_1048575    ... bench:      56,451 ns/iter (+/- 832) = 18574 MB/s
test memcmp_rust_unaligned_15         ... bench:           4 ns/iter (+/- 0) = 4000 MB/s
test memcmp_rust_unaligned_31         ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_unaligned_4095       ... bench:         198 ns/iter (+/- 7) = 20686 MB/s
test memcmp_rust_unaligned_63         ... bench:           6 ns/iter (+/- 0) = 10666 MB/s
test memcmp_rust_unaligned_7          ... bench:           4 ns/iter (+/- 0) = 2000 MB/s

Leaving out c32() entirely has a lesser but still significant performance impact (~1.3x slower), though compares on smaller strings seem to benefit:

No c32()
test memcmp_rust_1048576              ... bench:      31,405 ns/iter (+/- 426) = 33388 MB/s
test memcmp_rust_16                   ... bench:           3 ns/iter (+/- 0) = 5333 MB/s
test memcmp_rust_32                   ... bench:           3 ns/iter (+/- 0) = 10666 MB/s
test memcmp_rust_4096                 ... bench:         147 ns/iter (+/- 49) = 27863 MB/s
test memcmp_rust_64                   ... bench:           5 ns/iter (+/- 0) = 12800 MB/s
test memcmp_rust_8                    ... bench:           2 ns/iter (+/- 0) = 4000 MB/s
test memcmp_rust_unaligned_1048575    ... bench:      31,616 ns/iter (+/- 2,652) = 33165 MB/s
test memcmp_rust_unaligned_15         ... bench:           3 ns/iter (+/- 0) = 5333 MB/s
test memcmp_rust_unaligned_31         ... bench:           3 ns/iter (+/- 0) = 10666 MB/s
test memcmp_rust_unaligned_4095       ... bench:         139 ns/iter (+/- 35) = 29467 MB/s
test memcmp_rust_unaligned_63         ... bench:           4 ns/iter (+/- 0) = 16000 MB/s
test memcmp_rust_unaligned_7          ... bench:           2 ns/iter (+/- 0) = 4000 MB/s

Given that targets with SSE(2) do not seem to generate a call to memcmp even in debug mode it may be worth to keep using c32() if this feature is present. I don't know if it is worth the gamble though.

diff --git a/src/mem/x86_64.rs b/src/mem/x86_64.rs
index 4d2f6e5..e295e89 100644
--- a/src/mem/x86_64.rs
+++ b/src/mem/x86_64.rs
@@ -143,5 +143,9 @@ pub unsafe fn compare_bytes(a: *const u8, b: *const u8, n: usize) -> i32 {
     let c8 = |a: *const u64, b, n| cmp(a, b, n, c4);
     let c16 = |a: *const u128, b, n| cmp(a, b, n, c8);
     let c32 = |a: *const [u128; 2], b, n| cmp(a, b, n, c16);
-    c32(a.cast(), b.cast(), n)
+    if cfg!(target_feature = "sse2") {
+        c32(a.cast(), b.cast(), n)
+    } else {
+        c16(a.cast(), b.cast(), n)
+    }
 }
@Demindiro Demindiro changed the title x86_64 memcmp may recurse infinitely x86_64 memcmp may recurse infinitely without SSE Jun 10, 2022
@Amanieu
Copy link
Member

Amanieu commented Jun 10, 2022

I think it's fine to add the code that you suggest, with a clear comment that explains why it was done and that it depends on an internal implementation detail of LLVM. Fundamentally I don't think we have a way to guarantee that LLVM isn't calling the same builtin function recursively.

@bjorn3
Copy link
Member

bjorn3 commented Jun 10, 2022

This crate is marked with #![no_builtins] which afaik should prevent LLVM from emitting any intrinsic calls, including memcmp calls. I think this recursion should be considered a bug in LLVM.

@Demindiro
Copy link
Contributor Author

This crate is marked with #![no_builtins] which afaik should prevent LLVM from emitting any intrinsic calls, including memcmp calls. I think this recursion should be considered a bug in LLVM.

I think the culprit is rustc itself, since it explicitly emits a call to memcmp: https://github.com/rust-lang/rust/blob/c84594661c1b51feb539b479b58bb551fcf8e19a/compiler/rustc_codegen_llvm/src/intrinsic.rs#L331

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants