-
Notifications
You must be signed in to change notification settings - Fork 742
Performance regression in 0.44.0 for headers that (transitively) include AVX512F intrinsics #1465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I thought of a different way to go about creducing, and came up with a much smaller test case that still exhibits a substantial performance difference in the parsing step between the two variations of #define __MMINTRIN_H
typedef long long __m64 __attribute__((__vector_size__(8)));
typedef long long __v1di __attribute__((__vector_size__(8)));
typedef int __v2si __attribute__((__vector_size__(8)));
typedef short __v4hi __attribute__((__vector_size__(8)));
typedef char __v8qi __attribute__((__vector_size__(8)));
#define __DEFAULT_FN_ATTRS __attribute__((__always_inline__, __nodebug__, __target__("mmx")))
static __inline__ void __DEFAULT_FN_ATTRS _mm_empty(void) { __builtin_ia32_emms(); }
static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_cvtsi32_si64(int __i) { return (__m64)__builtin_ia32_vec_init_v2si(__i, 0); }
static __inline__ int __DEFAULT_FN_ATTRS _mm_cvtsi64_si32(__m64 __m) { return __builtin_ia32_vec_ext_v2si((__v2si)__m, 0); }
static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_cvtsi64_m64(long long __i) { return (__m64)__i; }
long long __DEFAULT_FN_ATTRS _mm_cvtm64_si64(__m64 __m) { return (long long)__m; }
static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_packs_pi16(__m64 __m1, __m64 __m2) { return (__m64)__builtin_ia32_packsswb((__v4hi)__m1, (__v4hi)__m2); }
static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_packs_pi32(__m64 __m1, __m64 __m2) { return (__m64)__builtin_ia32_packssdw((__v2si)__m1, (__v2si)__m2); }
static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_packs_pu16(__m64 __m1, __m64 __m2) { return (__m64)__builtin_ia32_packuswb((__v4hi)__m1, (__v4hi)__m2); }
static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_add_pi8(__m64 __m1, __m64 __m2) { return (__m64)__builtin_ia32_paddb((__v8qi)__m1, (__v8qi)__m2); }
static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_add_pi16(__m64 __m1, __m64 __m2) { return (__m64)__builtin_ia32_paddw((__v4hi)__m1, (__v4hi)__m2); }
static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_add_pi32(__m64 __m1, __m64 __m2) { return (__m64)__builtin_ia32_paddd((__v2si)__m1, (__v2si)__m2); }
static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_adds_pi8(__m64 __m1, __m64 __m2) { return (__m64)__builtin_ia32_paddsb((__v8qi)__m1, (__v8qi)__m2); }
static __inline__ __m64 __DEFAULT_FN_ATTRS _mm_sub_pi8(__m64 __m1, __m64 __m2) { return (__m64)__builtin_ia32_psubb((__v8qi)__m1, (__v8qi)__m2); }
I'll note that it could probably get smaller and still exhibit a measurable performance difference, I just set a cutoff of runtime for the "good" version to keep things statistically relevant. |
Hi, thanks for the report, it's great! So this means this is a regression from d34e10f. I'm not really sure we can do much better to find unexposed attributes here... Also, could do some profiling to see if we're doing something dumb, I'd be surprised if not... |
FWIW just out of curiosity took a quick look, here's what perf has to say about our parsing times for this test on my machine:
|
Instead of converting all the tokens to utf-8 before-hand, which is costly, and allocating a new vector unconditionally (on top of the one clang already allocates), just do the tokenization more lazily. There's actually only one place in the codebase which needs the utf-8 string, all the others can just work with the byte slice from clang. This should have no behavior change, other than be faster. In particular, this halves the time on my machine spent on the test-case from rust-lang#1465. I'm not completely sure that this is going to be enough to make it acceptable, but we should probably do it regardless.
I opened #1466 with a partial fix, can you confirm it helps on your machine? I still think it may not be good enough and we may need to put the attribute detection behind a flag, but... |
@emilio, thank you for your prompt attention! I really appreciate it. While I can confirm that it does help, I can also confirm your suspicion that it's not enough. The I'm not too familiar with the clang API, so I'm afraid I can't help diagnose much past this, but it's clear there are some pretty nasty asymptotics with this addition. If we could use a flag to disable it, or if it was opt-in, that would be great. |
Ok, I'll merge that since it should also help for cases we're already tokenizing, like macros and such, and put the function attribute stuff behind a flag for now. |
Instead of converting all the tokens to utf-8 before-hand, which is costly, and allocating a new vector unconditionally (on top of the one clang already allocates), just do the tokenization more lazily. There's actually only one place in the codebase which needs the utf-8 string, all the others can just work with the byte slice from clang. This should have no behavior change, other than be faster. In particular, this halves the time on my machine spent on the test-case from rust-lang#1465. I'm not completely sure that this is going to be enough to make it acceptable, but we should probably do it regardless.
Given it was a considerable performance hit under some workloads. Closes rust-lang#1465.
Given it was a considerable performance hit under some workloads. Closes rust-lang#1465.
#1467 should resolve this, just waiting for CI. |
Thanks, this works great! |
Although, bindgen needs .enable_function_attribute_detection() to process __attribute__((__warn_unused_result__)) because parsing attrs can be really slow in certain cases. Benches were performed to confirm our case doesn't face that issue. References: rust-lang/rust-bindgen#2149 rust-lang/rust-bindgen#1465 rust-lang/rust-bindgen#1466 rust-lang/rust-bindgen#1467
`bindgen` is able to detect certain function attributes and annotate functions correspondingly in its output for the Rust side, when the `--enable-function-attribute-detection` is passed. In particular, it is currently able to use `__must_check` in C (`#[must_use]` in Rust), which give us a bunch of annotations that are nice to have to prevent possible issues in Rust abstractions, e.g.: extern "C" { + #[must_use] pub fn kobject_add( kobj: *mut kobject, parent: *mut kobject, fmt: *const core::ffi::c_char, ... ) -> core::ffi::c_int; } Apparently, there are edge cases where this can make generation very slow, which is why it is behind a flag [1], but it does not seem to affect us in any major way at the moment. Link: rust-lang/rust-bindgen#1465 [1] Link: https://lore.kernel.org/rust-for-linux/CANiq72=u5Nrz_NW3U3_VqywJkD8pECA07q2pFDd1wjtXOWdkAQ@mail.gmail.com/ Signed-off-by: Miguel Ojeda <[email protected]>
`bindgen` is able to detect certain function attributes and annotate functions correspondingly in its output for the Rust side, when the `--enable-function-attribute-detection` is passed. In particular, it is currently able to use `__must_check` in C (`#[must_use]` in Rust), which give us a bunch of annotations that are nice to have to prevent possible issues in Rust abstractions, e.g.: extern "C" { + #[must_use] pub fn kobject_add( kobj: *mut kobject, parent: *mut kobject, fmt: *const core::ffi::c_char, ... ) -> core::ffi::c_int; } Apparently, there are edge cases where this can make generation very slow, which is why it is behind a flag [1], but it does not seem to affect us in any major way at the moment. Thus enable it. Link: rust-lang/rust-bindgen#1465 [1] Link: https://lore.kernel.org/rust-for-linux/CANiq72=u5Nrz_NW3U3_VqywJkD8pECA07q2pFDd1wjtXOWdkAQ@mail.gmail.com/ Reviewed-by: Alice Ryhl <[email protected]> Tested-by: Alice Ryhl <[email protected]> Reviewed-by: Gary Guo <[email protected]> Acked-by: Danilo Krummrich <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Miguel Ojeda <[email protected]>
Input C/C++ Header
Unfortunately, this report has
#include
s because the issue disappears when run on the preprocessed input. To hopefully counteract that a bit, I'm happy to provide more system information, but this has been observed on three different machines of different Intel generation and underlying kernel. For my machine:Bindgen Invocation
Using either the 0.44.0 release, or the current version from
master
:Actual Results
The bindings are successfully output, with timing output:
Expected Results
If I change the definition of
clang::Cursor::has_simple_attr()
to the following:And run the same command:
Then the bindings are successfully output, but with this very different timing output:
The text was updated successfully, but these errors were encountered: