Skip to content

Specialize strlen for x86_64. #516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 12, 2023
Merged

Specialize strlen for x86_64. #516

merged 11 commits into from
Mar 12, 2023

Conversation

TDecking
Copy link
Contributor

No description provided.


asm!(
// search for a zero byte
"xor al, al",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xor eax, eax avoids a potential partial register stall and is 1 byte shorter I believe.

"xor al, al",

// unbounded memory region
"xor rcx, rcx",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xor ecx, ecx has the same effect and saves a rex prefix.

"xor rcx, rcx",
"not rcx",

// forward direction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is guaranteed to be set to the forward direction already due to abi requirements.

@Amanieu
Copy link
Member

Amanieu commented Feb 22, 2023

Have you profiled this to confirm that it is indeed faster than the generic version? I was under the impression that the x86 string instructions tend to have relatively poor performance due to being microcoded.

@TDecking
Copy link
Contributor Author

Well...

I thought I benched it when I submitted this PR. Turns out my benchmarking routine had a bug.
bench.zip

I rerun the benchmark and uploaded the results as well as the used program.
It did show that repne scasb is indeed bad. That said, every other version (that is not the naive implementation)
was faster.

I'll rewrite this PR. Please be patient.

@Amanieu
Copy link
Member

Amanieu commented Feb 23, 2023

Is it really necessary to implement this in assembly? I feel that an implementation that used the SSE intrinsics would be much more readable and easier to maintain.

@TDecking
Copy link
Contributor Author

The issue lies within the performed memory access operations.
The code relies on the access operations being properly aligned
in order to avoid crossing a page boundary, ensuring success of the operation,
even if reading beyond the terminating zero.
By lowering to assembly I can ensure correct behaviour of the operation in every
circumstance, but if the access operations are written in Rust, there is a problem:
Any read beyond the terminating zero is considered undefined behaviour.
But in order to get a speedup, these kinds of reads are nesseccary.

I did find a possible compromise in the snippet below, which uses assembly
only for memory access operations. That said, it looks like that any implementation will either

  • use assembly
  • invoke UB
pub unsafe extern "C" fn strlen(mut s: *const std::ffi::c_char) -> usize {
    use std::arch::x86_64::*;
    use std::arch::asm;

    let mut n = 0;

    for _ in 0..4 {
        if *s == 0 {
            return n;
        }

        n += 1;
        s = s.add(1);
    }

    let align = s as usize & 15;
    let mut s = ((s as usize) - align) as *const __m128i;
    let zero = _mm_set1_epi8(0);

    let x = {
        let r; asm!("movdqa {dest}, [{addr}]", addr = in(reg) s, dest = out(xmm_reg) r);
        r
    };
    let cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(x, zero)) >> align;

    if cmp != 0 {
        return n + cmp.trailing_zeros() as usize;
    }

    n += 16 - align;
    s = s.add(1);

    loop {
        let x = {
            let r; asm!("movdqa {dest}, [{addr}]", addr = in(reg) s, dest = out(xmm_reg) r);
            r
        };
        let cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(x, zero)) as u32;
        if cmp == 0 {
            n += 16;
            s = s.add(1);
        } else {
            return n + cmp.trailing_zeros() as usize;
        }
    }
}

@Amanieu
Copy link
Member

Amanieu commented Mar 5, 2023

I think this version using assembly just for the accesses is fine and definitely more readable. Have you benchmarked it?

@TDecking
Copy link
Contributor Author

TDecking commented Mar 5, 2023

Yes.

Benchmarking was done using this:

use criterion::*;

#[inline(never)]
pub unsafe extern "C" fn strlen_naive(mut s: *const std::ffi::c_char) -> usize {
    let mut n = 0;

    while *s != 0 {
        n += 1;
        s = s.add(1);
    }

    n
}

#[inline(never)]
pub unsafe extern "C" fn strlen_kernel(mut s: *const std::ffi::c_char) -> usize {
    use std::arch::asm;

    let mut n = 0;

    while s as usize & 7 != 0 {
        if *s == 0 {
            return n;
        }

        n += 1;
        s = s.add(1);
    }

    let mut s = s as *const u64;

    loop {
        let mut cs = {
            let r: u64;
            asm!("mov {dest}, [{addr}]", addr = in(reg) s, dest = out(reg) r);
            r
        };
        // Detect if a word has a zero byte, taken from
        // https://graphics.stanford.edu/~seander/bithacks.html
        if (cs.wrapping_sub(0x0101010101010101) & !cs & 0x8080808080808080) != 0 {
            loop {
                if cs & 255 == 0 {
                    return n;
                } else {
                    cs >>= 8;
                    n += 1;
                }
            }
        } else {
            n += 8;
            s = s.add(1);
        }
    }
}

pub unsafe extern "C" fn strlen_sse(mut s: *const std::ffi::c_char) -> usize {
    use std::arch::x86_64::*;
    use std::arch::asm;

    let mut n = 0;

    for _ in 0..4 {
        if *s == 0 {
            return n;
        }

        n += 1;
        s = s.add(1);
    }

    let align = s as usize & 15;
    let mut s = ((s as usize) - align) as *const __m128i;
    let zero = _mm_set1_epi8(0);

    let x = {
        let r;
        asm!("movdqa {dest}, [{addr}]", addr = in(reg) s, dest = out(xmm_reg) r);
        r
    };
    let cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(x, zero)) >> align;

    if cmp != 0 {
        return n + cmp.trailing_zeros() as usize;
    }

    n += 16 - align;
    s = s.add(1);

    loop {
        let x = {
            let r;
            asm!("movdqa {dest}, [{addr}]", addr = in(reg) s, dest = out(xmm_reg) r);
            r
        };
        let cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(x, zero)) as u32;
        if cmp == 0 {
            n += 16;
            s = s.add(1);
        } else {
            return n + cmp.trailing_zeros() as usize;
        }
    }
}

fn bench_strlen(c: &mut Criterion, len: usize) {
    let mut v = vec![1i8; len];
    v[len - 1] = 0;

    let mut group = c.benchmark_group(format!("strlen, length {}", len));

    group.bench_function("strlen_naive", |b| b.iter(|| {
        black_box(&mut v);
        let r = unsafe {
            strlen_naive(v.as_ptr())
        };
        assert_eq!(r, len - 1);
    }));

    group.bench_function("strlen_kernel", |b| b.iter(|| {
        black_box(&mut v);
        let r = unsafe {
            strlen_kernel(v.as_ptr())
        };
        assert_eq!(r, len - 1);
    }));

    group.bench_function("strlen_sse", |b| b.iter(|| {
        black_box(&mut v);
        let r = unsafe {
            strlen_sse(v.as_ptr())
        };
        assert_eq!(r, len - 1);
    }));
}

fn bench_strlen_1(c: &mut Criterion) {
    bench_strlen(c, 1)
}
fn bench_strlen_7(c: &mut Criterion) {
    bench_strlen(c, 7)
}
fn bench_strlen_15(c: &mut Criterion) {
    bench_strlen(c, 15)
}
fn bench_strlen_300(c: &mut Criterion) {
    bench_strlen(c, 300)
}
fn bench_strlen_2048(c: &mut Criterion) {
    bench_strlen(c, 2048)
}
fn bench_strlen_10_000(c: &mut Criterion) {
    bench_strlen(c, 10_000)
}
fn bench_strlen_50_000(c: &mut Criterion) {
    bench_strlen(c, 50_000)
}
fn bench_strlen_100_000(c: &mut Criterion) {
    bench_strlen(c, 100_000)
}
fn bench_strlen_1_000_000(c: &mut Criterion) {
    bench_strlen(c, 1_000_000)
}

criterion_group! { bench_strlen_group,
    bench_strlen_1,
    bench_strlen_7,
    bench_strlen_15,
    bench_strlen_300,
    bench_strlen_2048,
    bench_strlen_10_000,
    bench_strlen_50_000,
    bench_strlen_100_000,
    bench_strlen_1_000_000,
}
criterion_main!(bench_strlen_group);

Result is the following:

strlen, length 1/strlen_naive
                        time:   [2.5358 ns 2.5532 ns 2.5741 ns]
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  8 (8.00%) high mild
strlen, length 1/strlen_kernel
                        time:   [3.3007 ns 3.3748 ns 3.4687 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
strlen, length 1/strlen_sse
                        time:   [1.4226 ns 1.4333 ns 1.4471 ns]
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe

strlen, length 7/strlen_naive
                        time:   [5.1725 ns 5.2056 ns 5.2473 ns]
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe
strlen, length 7/strlen_kernel
                        time:   [5.3166 ns 5.3562 ns 5.4044 ns]
Found 15 outliers among 100 measurements (15.00%)
  5 (5.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe
strlen, length 7/strlen_sse
                        time:   [3.1659 ns 3.1847 ns 3.2078 ns]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

strlen, length 15/strlen_naive
                        time:   [9.3916 ns 9.4831 ns 9.5879 ns]
Found 16 outliers among 100 measurements (16.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
  7 (7.00%) high severe
strlen, length 15/strlen_kernel
                        time:   [6.1240 ns 6.1765 ns 6.2424 ns]
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe
strlen, length 15/strlen_sse
                        time:   [3.1807 ns 3.2066 ns 3.2354 ns]
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

strlen, length 300/strlen_naive
                        time:   [151.06 ns 151.76 ns 152.57 ns]
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe
strlen, length 300/strlen_kernel
                        time:   [41.921 ns 42.132 ns 42.433 ns]
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe
strlen, length 300/strlen_sse
                        time:   [13.309 ns 13.488 ns 13.691 ns]
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

strlen, length 2048/strlen_naive
                        time:   [963.69 ns 968.02 ns 972.93 ns]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  4 (4.00%) high mild
  2 (2.00%) high severe
strlen, length 2048/strlen_kernel
                        time:   [616.19 ns 658.43 ns 704.35 ns]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
strlen, length 2048/strlen_sse
                        time:   [81.997 ns 82.374 ns 82.831 ns]
Found 18 outliers among 100 measurements (18.00%)
  7 (7.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

strlen, length 10000/strlen_naive
                        time:   [4.6937 µs 4.7299 µs 4.7764 µs]
Found 20 outliers among 100 measurements (20.00%)
  7 (7.00%) low mild
  4 (4.00%) high mild
  9 (9.00%) high severe
strlen, length 10000/strlen_kernel
                        time:   [982.52 ns 986.92 ns 991.75 ns]
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) low mild
  3 (3.00%) high mild
  8 (8.00%) high severe
strlen, length 10000/strlen_sse
                        time:   [347.20 ns 351.65 ns 358.36 ns]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

strlen, length 50000/strlen_naive
                        time:   [23.575 µs 23.913 µs 24.339 µs]
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) low mild
  4 (4.00%) high mild
  6 (6.00%) high severe
strlen, length 50000/strlen_kernel
                        time:   [4.9225 µs 4.9473 µs 4.9736 µs]
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe
strlen, length 50000/strlen_sse
                        time:   [2.2877 µs 2.3340 µs 2.3877 µs]
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe

strlen, length 100000/strlen_naive
                        time:   [46.665 µs 46.890 µs 47.178 µs]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low mild
  6 (6.00%) high severe
strlen, length 100000/strlen_kernel
                        time:   [9.8460 µs 9.9094 µs 9.9844 µs]
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe
strlen, length 100000/strlen_sse
                        time:   [4.5188 µs 4.5412 µs 4.5654 µs]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe

strlen, length 1000000/strlen_naive
                        time:   [481.55 µs 494.21 µs 510.94 µs]
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  9 (9.00%) high severe
strlen, length 1000000/strlen_kernel
                        time:   [99.799 µs 100.53 µs 101.40 µs]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe
strlen, length 1000000/strlen_sse
                        time:   [48.657 µs 48.952 µs 49.280 µs]
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

@Amanieu
Copy link
Member

Amanieu commented Mar 5, 2023

Perfect! I'll merge this once you update the PR to only have the memory access as assembly.

@TDecking
Copy link
Contributor Author

TDecking commented Mar 7, 2023

@Amanieu The PR is ready.

@Amanieu Amanieu merged commit b788cf3 into rust-lang:master Mar 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants