There exists significantly faster division algorithms for certain CPUs #265

AaronKutch · 2018-11-29T01:26:30Z

Over the past few months I have been looking at different division algorithms, and I believe that this one is the fastest for 128 bit division on CPUs with a fast 64 bit hardware divider. I made a crate for it, specialized-div-rem. I initially thought that the crate would have more than one algorithm in it, but it turns out that the compiler likes inlining the appropriate branches in cases where its important.
The only thing I might add to it is a simple binary division algorithm for when the most significant bits of the dividend and divisor are <6 bits from each other and located in the higher 64 bits. I don't know if it is worth it to have more conditional branches for that use case, I think this algorithm has the fewest branches encountered by program flow of any algorithm out there.
I also expect that the 64 bit division algorithm in that crate is useful for 32 bit computers but I don't have one to test it out.
Sidenote that I also made a more general version for the apint crate but I need to put much more recursive work into that one before it becomes competitive.

The text was updated successfully, but these errors were encountered:

alexcrichton · 2018-11-29T15:38:45Z

This sounds pretty slick! We're always up for taking better algorithms here :)

AaronKutch · 2018-12-27T19:50:34Z

Is there a quick way to determine the performance difference between having or not having indexing checks (so that I do not have to replace all the vector[index] with unsafe{vector.get_unchecked(index)})? I would not do this to the algorithm if it made its way into Rust unless it was formally proven, but I am interested in finding out the real perf differences in this and other projects. I tried setting the panic hook to std::hint::unreachable_unchecked(), but enabling LTO and -O3 doesn't seem to inline propagate the unreachable. It seems crazy, but should there be another possible Cargo.toml panic key, say panic = undefined?

alexcrichton · 2019-01-02T15:45:40Z

Currently there's no build-time configuration for that, it needs to be changed in the code itself

AaronKutch · 2019-01-19T07:14:59Z

I just published version 0.0.4 of my crate, which I think will be the last one I will ever publish until someone else finds an issue with it. I inspected the assembly output and perf with and without unchecked division, and the performance difference was at most a few percent. I do not think I can improve it anymore except with SIMD, but even then the perf difference would only be at most a few percent for certain blocks. Additionally, in practice LLVM appears to inline my function properly whenever only the quotient or only the remainder is used, so I do not plan on adding any extra functions.
Edit: to clarify, I temporarily tested the crate with unsafe unchecked divisions, but the final version I published is completely safe

AaronKutch · 2019-07-07T01:07:33Z

I discovered that on x86_64 there is a divq assembly function which allows for 128 by 64 bit divisions, but it is unlikely the compiler would use it due to the fact that it will throw a floating point error if the quotient does not fit in 64 bits. I added an asm flag to my crate specialized-div-rem and got a decent speed up for some cases.
I am seeing a speed improvement such that what Rust is currently using takes 2 to 8 times longer than my algorithm!

AaronKutch · 2019-12-08T22:56:07Z

What steps do I need to take if I want to see the Rust u/i128 divisions on x86_64 use my algorithm?

alexcrichton · 2019-12-09T17:58:01Z

The intrinsics called for u128 divisions (such as __udivti3) will need to be updated in this crate.

AaronKutch · 2020-09-12T02:23:25Z

fixed by #332

AaronKutch closed this as completed Sep 12, 2020

tgross35 pushed a commit to tgross35/compiler-builtins that referenced this issue Feb 23, 2025

Merge pull request rust-lang#265 from ankane/no_panic

c108db9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There exists significantly faster division algorithms for certain CPUs #265

There exists significantly faster division algorithms for certain CPUs #265

AaronKutch commented Nov 29, 2018

alexcrichton commented Nov 29, 2018

AaronKutch commented Dec 27, 2018 •

edited

Loading

alexcrichton commented Jan 2, 2019

AaronKutch commented Jan 19, 2019 •

edited

Loading

AaronKutch commented Jul 7, 2019

AaronKutch commented Dec 8, 2019

alexcrichton commented Dec 9, 2019

AaronKutch commented Sep 12, 2020

There exists significantly faster division algorithms for certain CPUs #265

There exists significantly faster division algorithms for certain CPUs #265

Comments

AaronKutch commented Nov 29, 2018

alexcrichton commented Nov 29, 2018

AaronKutch commented Dec 27, 2018 • edited Loading

alexcrichton commented Jan 2, 2019

AaronKutch commented Jan 19, 2019 • edited Loading

AaronKutch commented Jul 7, 2019

AaronKutch commented Dec 8, 2019

alexcrichton commented Dec 9, 2019

AaronKutch commented Sep 12, 2020

AaronKutch commented Dec 27, 2018 •

edited

Loading

AaronKutch commented Jan 19, 2019 •

edited

Loading