Skip to content

There exists significantly faster division algorithms for certain CPUs #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AaronKutch opened this issue Nov 29, 2018 · 8 comments
Closed

Comments

@AaronKutch
Copy link
Contributor

Over the past few months I have been looking at different division algorithms, and I believe that this one is the fastest for 128 bit division on CPUs with a fast 64 bit hardware divider. I made a crate for it, specialized-div-rem. I initially thought that the crate would have more than one algorithm in it, but it turns out that the compiler likes inlining the appropriate branches in cases where its important.
The only thing I might add to it is a simple binary division algorithm for when the most significant bits of the dividend and divisor are <6 bits from each other and located in the higher 64 bits. I don't know if it is worth it to have more conditional branches for that use case, I think this algorithm has the fewest branches encountered by program flow of any algorithm out there.
I also expect that the 64 bit division algorithm in that crate is useful for 32 bit computers but I don't have one to test it out.
Sidenote that I also made a more general version for the apint crate but I need to put much more recursive work into that one before it becomes competitive.

@alexcrichton
Copy link
Member

This sounds pretty slick! We're always up for taking better algorithms here :)

@AaronKutch
Copy link
Contributor Author

AaronKutch commented Dec 27, 2018

Is there a quick way to determine the performance difference between having or not having indexing checks (so that I do not have to replace all the vector[index] with unsafe{vector.get_unchecked(index)})? I would not do this to the algorithm if it made its way into Rust unless it was formally proven, but I am interested in finding out the real perf differences in this and other projects. I tried setting the panic hook to std::hint::unreachable_unchecked(), but enabling LTO and -O3 doesn't seem to inline propagate the unreachable. It seems crazy, but should there be another possible Cargo.toml panic key, say panic = undefined?

@alexcrichton
Copy link
Member

Currently there's no build-time configuration for that, it needs to be changed in the code itself

@AaronKutch
Copy link
Contributor Author

AaronKutch commented Jan 19, 2019

I just published version 0.0.4 of my crate, which I think will be the last one I will ever publish until someone else finds an issue with it. I inspected the assembly output and perf with and without unchecked division, and the performance difference was at most a few percent. I do not think I can improve it anymore except with SIMD, but even then the perf difference would only be at most a few percent for certain blocks. Additionally, in practice LLVM appears to inline my function properly whenever only the quotient or only the remainder is used, so I do not plan on adding any extra functions.
Edit: to clarify, I temporarily tested the crate with unsafe unchecked divisions, but the final version I published is completely safe

@AaronKutch
Copy link
Contributor Author

I discovered that on x86_64 there is a divq assembly function which allows for 128 by 64 bit divisions, but it is unlikely the compiler would use it due to the fact that it will throw a floating point error if the quotient does not fit in 64 bits. I added an asm flag to my crate specialized-div-rem and got a decent speed up for some cases.
I am seeing a speed improvement such that what Rust is currently using takes 2 to 8 times longer than my algorithm!

@AaronKutch
Copy link
Contributor Author

What steps do I need to take if I want to see the Rust u/i128 divisions on x86_64 use my algorithm?

@alexcrichton
Copy link
Member

The intrinsics called for u128 divisions (such as __udivti3) will need to be updated in this crate.

@AaronKutch
Copy link
Contributor Author

fixed by #332

tgross35 pushed a commit to tgross35/compiler-builtins that referenced this issue Feb 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants