Skip to content

rustc_session: be more precise about -Z plt=yes on x86-64? #141720

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
iximeow opened this issue May 29, 2025 · 6 comments
Open

rustc_session: be more precise about -Z plt=yes on x86-64? #141720

iximeow opened this issue May 29, 2025 · 6 comments
Labels
A-codegen Area: Code generation A-linkage Area: linking into static, shared libraries and binaries C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_64 Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@iximeow
Copy link
Contributor

iximeow commented May 29, 2025

in #109982 rustc switched to -Z plt=yes on non-x86-64 platforms for a bunch of good reasons. and stuck with -Z plt=no by default on x86-64 for also good reasons! unfortunately, defaulting to -Z plt=no is a slight pessimization in programs heavily dependent on calls into statically linked libraries.

PLT calls on x86 end up compiled to e8 <addr> calls, which at link time can be rewritten to direct calls to the callee, and presumably deletion of the GOT entry. when we skip the PLT on x86-64, it seems that linkers are unwilling to do a link-time optimization of ff 15 <GOT addr> into 90 e8 <fn addr> when the callee is local to the object, so an indirect call to the object-local persists*.

i expect -Z plt=no to be better than -Z plt=yes on x86-64 for all cases where the called functions are dynamically linked. i also expect -Z plt=no to be worse than -Z plt=yes on x86-64 for all cases where the called functions are statically linked and <4 GiB from their call sites. it'd be nice if we could skip non_lazy_bind if we know the called function is to be statically linked. if your compiled artifact is >4 GiB .. i've heard of such things but have no idea what's best :)

"if we know the called function is to be statically linked" is the more annoying problem, though, because rustc-link-lib tells rustc only what libraries get what kind of linkage. especially on Unix-y platforms we don't know which of those platforms will provide a given symbol. the extern block can have a #[link(kind="static")] attribute which i've used in this minimized example of the problem i'm talking about, which almost seems like enough information to choose when to do this optimization at codegen-time. unfortunately, if the source file says #[link(name="util", kind="static")] extern "C" { pub fn foo(); }, and then you compile that source like rustc -l dylib=util ..., the command-line parameter simply overrides the link attribute and you end up with a dynamic link to foo with the (in context) reasonable ff 15 [GOT_entry] call.

because of the #[link]/-l KIND=NAME interaction i'm really not sure what to do here. i was going to initially suggest plumbing #[link(kind="static")] through to inform if nonlazybind is appropriate, but i had expected that conflicting link directives would at least produce an error. silently ending up with the command line argument is pretty unfortunate. does it seem reasonable to plumb the #[link] attribute as a hint, advise #[link(kind="static")] for statically linked functions, and make conflicting #[link] and -l arguments produce an error?


memorysafety/rav1d#1417 is a more substantive case which motivates this issue, where hot code is a collection of assembly routines that are statically linked. i've written a longer analysis about the case in that issue, but it's just more supporting information around the observation above.

worse, for code that is hot around an indirect call to a constant target, branch prediction quite effectively hides the cost of this indirect call. if the hot code is more like a large region of warm code, the branch prediction can end up evicted and these indirect calls to a constant local function become quite costly.

worse (pt2), LLVM reasonably tries to improve the indirect call situation by hoisting loads to repeated calls of the same target, which can cause register pressure, additional spills, generally make this kind of unfortunate situation even worse.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label May 29, 2025
@Noratrieb Noratrieb added A-linkage Area: linking into static, shared libraries and binaries I-slow Issue: Problems and improvements with respect to performance of generated code. A-codegen Area: Code generation O-x86_64 Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels May 29, 2025
@jieyouxu jieyouxu added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label May 29, 2025
@Noratrieb
Copy link
Member

@nikic @bjorn3 @durin42

@bjorn3
Copy link
Member

bjorn3 commented May 29, 2025

i expect -Z plt=yes to be better than -Z plt=no on x86-64 for all cases where the called functions are dynamically linked.

It shouldn't be. We enable RELRO, so we always do eager binding of all symbols, so using the PLT adds overhead for dynamically linked functions.

i also expect -Z plt=no to be worse than -Z plt=yes on x86-64 for all cases where the called functions are statically linked and <4 GiB from their call sites.

Yeah, -Zplt=no prevents relaxing calls to possibly imported functions to direct pcrel calls. Note that until -Zdefault-visibility=hidden becomes the default, all calls between object files need to be resolved by the dynamic linker as default visibility allows a dylib to override the symbol even for local calls. We can't make it the default until a fixed ld.bfd is old enough though.

@iximeow
Copy link
Contributor Author

iximeow commented May 29, 2025

i expect -Z plt=yes to be ...

It shouldn't be.

i'd rephrased that along the way and effectively swapped yes/no so it was just backwards. sorry! what i meant to say is that on x86 i really cannot imagine a way in which the status quo is worse for dynamically linked functions, but it is always worse for statically linked functions.

i've swapped the yes and no to make this read correctly.

Note that until -Zdefault-visibility=hidden becomes the default ...

i don't follow, wouldn't this only apply if the statically linked symbols were produced via rustc? here, the statically linked code is from other .asm files. i would that expect in most cases of Rust code statically linking other libraries, the other libraries are probably from C with appropriate visibility modifiers.

also, #105518 looks like the default would be protected, not hidden?

@bjorn3
Copy link
Member

bjorn3 commented May 29, 2025

i don't follow, wouldn't this only apply if the statically linked symbols were produced via rustc? here, the statically linked code is from other .asm files. i would that expect in most cases of Rust code statically linking other libraries, the other libraries are probably from C with appropriate visibility modifiers.

True you are free to use protected visibility in the asm code (please don't use hidden visibility. that breaks with dylibs as there is no guarantee that the rust caller ends up in the same dylib), but -Zplt=yes would regress performance for rust code a bit until the symbol visibility default changes.

also, #105518 looks like the default would be protected, not hidden?

Yes, though the same logic applies. Either hidden or protected visibility is enough for PLT relaxation to work.

@dramforever
Copy link

Prompted by this being mentioned elsewhere I did some of my own investigation. The visibility stuff seems reasonable, but why is relaxation not kicking in?

It seems that rustc is not properly emitting relaxable (X at the end) R_X86_64{_REX,}_GOTPCRELX calls? This seems fixable but I don't know where one would start looking.

Another thing already mentioned is LLVM trying to help by caching the address is unhelpful in the relaxable case. This also affects Clang. I don't know what we can do.

@nikic
Copy link
Contributor

nikic commented Jun 1, 2025

@dramforever Because of broken linkers, see #115267. Possibly enough time has passed that enabling ELF relaxations would have less fallout now.

Edit: Nope, two years later there is still no new cross release, so we're going to see exactly the same issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-linkage Area: linking into static, shared libraries and binaries C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_64 Target: x86-64 processors (like x86_64-*) (also known as amd64 and x64) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants