Specialize `strlen` for `x86_64`. #516

TDecking · 2023-02-21T22:27:37Z

No description provided.

bjorn3 · 2023-02-21T22:44:05Z

src/mem/x86_64.rs

+
+    asm!(
+        // search for a zero byte
+        "xor al, al",


xor eax, eax avoids a potential partial register stall and is 1 byte shorter I believe.

bjorn3 · 2023-02-21T22:45:34Z

src/mem/x86_64.rs

+        "xor al, al",
+
+        // unbounded memory region
+        "xor rcx, rcx",


xor ecx, ecx has the same effect and saves a rex prefix.

bjorn3 · 2023-02-21T22:46:39Z

src/mem/x86_64.rs

+        "xor rcx, rcx",
+        "not rcx",
+
+        // forward direction


I believe this is guaranteed to be set to the forward direction already due to abi requirements.

Amanieu · 2023-02-22T14:19:24Z

Have you profiled this to confirm that it is indeed faster than the generic version? I was under the impression that the x86 string instructions tend to have relatively poor performance due to being microcoded.

TDecking · 2023-02-22T15:26:56Z

Well...

I thought I benched it when I submitted this PR. Turns out my benchmarking routine had a bug.
bench.zip

I rerun the benchmark and uploaded the results as well as the used program.
It did show that repne scasb is indeed bad. That said, every other version (that is not the naive implementation)
was faster.

I'll rewrite this PR. Please be patient.

src/mem/x86_64.rs

Amanieu · 2023-02-23T18:19:01Z

Is it really necessary to implement this in assembly? I feel that an implementation that used the SSE intrinsics would be much more readable and easier to maintain.

TDecking · 2023-02-27T17:39:49Z

The issue lies within the performed memory access operations.
The code relies on the access operations being properly aligned
in order to avoid crossing a page boundary, ensuring success of the operation,
even if reading beyond the terminating zero.
By lowering to assembly I can ensure correct behaviour of the operation in every
circumstance, but if the access operations are written in Rust, there is a problem:
Any read beyond the terminating zero is considered undefined behaviour.
But in order to get a speedup, these kinds of reads are nesseccary.

I did find a possible compromise in the snippet below, which uses assembly
only for memory access operations. That said, it looks like that any implementation will either

use assembly
invoke UB

pub unsafe extern "C" fn strlen(mut s: *const std::ffi::c_char) -> usize {
    use std::arch::x86_64::*;
    use std::arch::asm;

    let mut n = 0;

    for _ in 0..4 {
        if *s == 0 {
            return n;
        }

        n += 1;
        s = s.add(1);
    }

    let align = s as usize & 15;
    let mut s = ((s as usize) - align) as *const __m128i;
    let zero = _mm_set1_epi8(0);

    let x = {
        let r; asm!("movdqa {dest}, [{addr}]", addr = in(reg) s, dest = out(xmm_reg) r);
        r
    };
    let cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(x, zero)) >> align;

    if cmp != 0 {
        return n + cmp.trailing_zeros() as usize;
    }

    n += 16 - align;
    s = s.add(1);

    loop {
        let x = {
            let r; asm!("movdqa {dest}, [{addr}]", addr = in(reg) s, dest = out(xmm_reg) r);
            r
        };
        let cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(x, zero)) as u32;
        if cmp == 0 {
            n += 16;
            s = s.add(1);
        } else {
            return n + cmp.trailing_zeros() as usize;
        }
    }
}

Amanieu · 2023-03-05T16:04:13Z

I think this version using assembly just for the accesses is fine and definitely more readable. Have you benchmarked it?

TDecking · 2023-03-05T17:58:26Z

Yes.

Benchmarking was done using this:

use criterion::*;

#[inline(never)]
pub unsafe extern "C" fn strlen_naive(mut s: *const std::ffi::c_char) -> usize {
    let mut n = 0;

    while *s != 0 {
        n += 1;
        s = s.add(1);
    }

    n
}

#[inline(never)]
pub unsafe extern "C" fn strlen_kernel(mut s: *const std::ffi::c_char) -> usize {
    use std::arch::asm;

    let mut n = 0;

    while s as usize & 7 != 0 {
        if *s == 0 {
            return n;
        }

        n += 1;
        s = s.add(1);
    }

    let mut s = s as *const u64;

    loop {
        let mut cs = {
            let r: u64;
            asm!("mov {dest}, [{addr}]", addr = in(reg) s, dest = out(reg) r);
            r
        };
        // Detect if a word has a zero byte, taken from
        // https://graphics.stanford.edu/~seander/bithacks.html
        if (cs.wrapping_sub(0x0101010101010101) & !cs & 0x8080808080808080) != 0 {
            loop {
                if cs & 255 == 0 {
                    return n;
                } else {
                    cs >>= 8;
                    n += 1;
                }
            }
        } else {
            n += 8;
            s = s.add(1);
        }
    }
}

pub unsafe extern "C" fn strlen_sse(mut s: *const std::ffi::c_char) -> usize {
    use std::arch::x86_64::*;
    use std::arch::asm;

    let mut n = 0;

    for _ in 0..4 {
        if *s == 0 {
            return n;
        }

        n += 1;
        s = s.add(1);
    }

    let align = s as usize & 15;
    let mut s = ((s as usize) - align) as *const __m128i;
    let zero = _mm_set1_epi8(0);

    let x = {
        let r;
        asm!("movdqa {dest}, [{addr}]", addr = in(reg) s, dest = out(xmm_reg) r);
        r
    };
    let cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(x, zero)) >> align;

    if cmp != 0 {
        return n + cmp.trailing_zeros() as usize;
    }

    n += 16 - align;
    s = s.add(1);

    loop {
        let x = {
            let r;
            asm!("movdqa {dest}, [{addr}]", addr = in(reg) s, dest = out(xmm_reg) r);
            r
        };
        let cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(x, zero)) as u32;
        if cmp == 0 {
            n += 16;
            s = s.add(1);
        } else {
            return n + cmp.trailing_zeros() as usize;
        }
    }
}

fn bench_strlen(c: &mut Criterion, len: usize) {
    let mut v = vec![1i8; len];
    v[len - 1] = 0;

    let mut group = c.benchmark_group(format!("strlen, length {}", len));

    group.bench_function("strlen_naive", |b| b.iter(|| {
        black_box(&mut v);
        let r = unsafe {
            strlen_naive(v.as_ptr())
        };
        assert_eq!(r, len - 1);
    }));

    group.bench_function("strlen_kernel", |b| b.iter(|| {
        black_box(&mut v);
        let r = unsafe {
            strlen_kernel(v.as_ptr())
        };
        assert_eq!(r, len - 1);
    }));

    group.bench_function("strlen_sse", |b| b.iter(|| {
        black_box(&mut v);
        let r = unsafe {
            strlen_sse(v.as_ptr())
        };
        assert_eq!(r, len - 1);
    }));
}

fn bench_strlen_1(c: &mut Criterion) {
    bench_strlen(c, 1)
}
fn bench_strlen_7(c: &mut Criterion) {
    bench_strlen(c, 7)
}
fn bench_strlen_15(c: &mut Criterion) {
    bench_strlen(c, 15)
}
fn bench_strlen_300(c: &mut Criterion) {
    bench_strlen(c, 300)
}
fn bench_strlen_2048(c: &mut Criterion) {
    bench_strlen(c, 2048)
}
fn bench_strlen_10_000(c: &mut Criterion) {
    bench_strlen(c, 10_000)
}
fn bench_strlen_50_000(c: &mut Criterion) {
    bench_strlen(c, 50_000)
}
fn bench_strlen_100_000(c: &mut Criterion) {
    bench_strlen(c, 100_000)
}
fn bench_strlen_1_000_000(c: &mut Criterion) {
    bench_strlen(c, 1_000_000)
}

criterion_group! { bench_strlen_group,
    bench_strlen_1,
    bench_strlen_7,
    bench_strlen_15,
    bench_strlen_300,
    bench_strlen_2048,
    bench_strlen_10_000,
    bench_strlen_50_000,
    bench_strlen_100_000,
    bench_strlen_1_000_000,
}
criterion_main!(bench_strlen_group);

Result is the following:

strlen, length 1/strlen_naive
                        time:   [2.5358 ns 2.5532 ns 2.5741 ns]
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  8 (8.00%) high mild
strlen, length 1/strlen_kernel
                        time:   [3.3007 ns 3.3748 ns 3.4687 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
strlen, length 1/strlen_sse
                        time:   [1.4226 ns 1.4333 ns 1.4471 ns]
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

strlen, length 7/strlen_naive
                        time:   [5.1725 ns 5.2056 ns 5.2473 ns]
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe
strlen, length 7/strlen_kernel
                        time:   [5.3166 ns 5.3562 ns 5.4044 ns]
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe
strlen, length 7/strlen_sse
                        time:   [3.1659 ns 3.1847 ns 3.2078 ns]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

strlen, length 15/strlen_naive
                        time:   [9.3916 ns 9.4831 ns 9.5879 ns]
Found 16 outliers among 100 measurements (16.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe
strlen, length 15/strlen_kernel
                        time:   [6.1240 ns 6.1765 ns 6.2424 ns]
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe
strlen, length 15/strlen_sse
                        time:   [3.1807 ns 3.2066 ns 3.2354 ns]
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

strlen, length 300/strlen_naive
                        time:   [151.06 ns 151.76 ns 152.57 ns]
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe
strlen, length 300/strlen_kernel
                        time:   [41.921 ns 42.132 ns 42.433 ns]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe
strlen, length 300/strlen_sse
                        time:   [13.309 ns 13.488 ns 13.691 ns]
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

strlen, length 2048/strlen_naive
                        time:   [963.69 ns 968.02 ns 972.93 ns]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
strlen, length 2048/strlen_kernel
                        time:   [616.19 ns 658.43 ns 704.35 ns]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
strlen, length 2048/strlen_sse
                        time:   [81.997 ns 82.374 ns 82.831 ns]
Found 18 outliers among 100 measurements (18.00%)
  7 (7.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

strlen, length 10000/strlen_naive
                        time:   [4.6937 µs 4.7299 µs 4.7764 µs]
Found 20 outliers among 100 measurements (20.00%)
  7 (7.00%) low mild
  4 (4.00%) high mild
  9 (9.00%) high severe
strlen, length 10000/strlen_kernel
                        time:   [982.52 ns 986.92 ns 991.75 ns]
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe
strlen, length 10000/strlen_sse
                        time:   [347.20 ns 351.65 ns 358.36 ns]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

strlen, length 50000/strlen_naive
                        time:   [23.575 µs 23.913 µs 24.339 µs]
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) low mild
  4 (4.00%) high mild
  6 (6.00%) high severe
strlen, length 50000/strlen_kernel
                        time:   [4.9225 µs 4.9473 µs 4.9736 µs]
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
strlen, length 50000/strlen_sse
                        time:   [2.2877 µs 2.3340 µs 2.3877 µs]
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

strlen, length 100000/strlen_naive
                        time:   [46.665 µs 46.890 µs 47.178 µs]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low mild
  6 (6.00%) high severe
strlen, length 100000/strlen_kernel
                        time:   [9.8460 µs 9.9094 µs 9.9844 µs]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe
strlen, length 100000/strlen_sse
                        time:   [4.5188 µs 4.5412 µs 4.5654 µs]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

strlen, length 1000000/strlen_naive
                        time:   [481.55 µs 494.21 µs 510.94 µs]
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  9 (9.00%) high severe
strlen, length 1000000/strlen_kernel
                        time:   [99.799 µs 100.53 µs 101.40 µs]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe
strlen, length 1000000/strlen_sse
                        time:   [48.657 µs 48.952 µs 49.280 µs]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Amanieu · 2023-03-05T23:17:31Z

Perfect! I'll merge this once you update the PR to only have the memory access as assembly.

TDecking · 2023-03-07T12:42:44Z

@Amanieu The PR is ready.

TDecking added 3 commits February 21, 2023 23:13

Specialize strlen for x86_64.

2a67ad7

Correct path.

7711331

Update path for argument.

1fdf932

bjorn3 reviewed Feb 21, 2023

View reviewed changes

TDecking added 2 commits February 22, 2023 00:07

Improve assembly quality + AT&T syntax.

0a0fa0b

Remove superfluous comment.

1a2f3b2

Change implementation to SSE

7e4742d

bjorn3 reviewed Feb 22, 2023

View reviewed changes

src/mem/x86_64.rs Outdated Show resolved Hide resolved

TDecking added 2 commits February 22, 2023 22:16

Provide a non-sse version for x86_64.

9c0a19c

Formatting

afa3d3e

TDecking added 3 commits March 6, 2023 19:20

Final version.

1df0d1c

formatting

4f77170

more fixing

6488b26

Amanieu merged commit b788cf3 into rust-lang:master Mar 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specialize `strlen` for `x86_64`. #516

Specialize `strlen` for `x86_64`. #516

TDecking commented Feb 21, 2023

bjorn3 Feb 21, 2023

bjorn3 Feb 21, 2023

bjorn3 Feb 21, 2023

Amanieu commented Feb 22, 2023

TDecking commented Feb 22, 2023

Amanieu commented Feb 23, 2023

TDecking commented Feb 27, 2023

Amanieu commented Mar 5, 2023

TDecking commented Mar 5, 2023

Amanieu commented Mar 5, 2023

TDecking commented Mar 7, 2023

Specialize strlen for x86_64. #516

Specialize strlen for x86_64. #516

Conversation

TDecking commented Feb 21, 2023

bjorn3 Feb 21, 2023

Choose a reason for hiding this comment

bjorn3 Feb 21, 2023

Choose a reason for hiding this comment

bjorn3 Feb 21, 2023

Choose a reason for hiding this comment

Amanieu commented Feb 22, 2023

TDecking commented Feb 22, 2023

Amanieu commented Feb 23, 2023

TDecking commented Feb 27, 2023

Amanieu commented Mar 5, 2023

TDecking commented Mar 5, 2023

Amanieu commented Mar 5, 2023

TDecking commented Mar 7, 2023

Specialize `strlen` for `x86_64`. #516

Specialize `strlen` for `x86_64`. #516