-
Notifications
You must be signed in to change notification settings - Fork 414
std::tie is slower than expected for symbiflow benchmark tests #1225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Adding @kmurray to comment. Finally, the code is a bit nasty looking and difficult to comprehend. No action until you get the QoR data, but if it shows good QoR you'd need to comment it and make it as readable as you can. Could it be wrapped in a function and still maintain speed? |
I have compared the vtr_stdout.log before and after the change, and I did not see any qor change there. I will run the qor and titan benchmark tests and see what's the qor result. I will create some inline functions and wrap the change into the functions. |
@hzeller - Could you help @xuqinziyue find a solution which doesn't require the nasty code? FYI - @litghost |
How does a primitive, but slightly more 'boring to read' solution like this measures, performance wise: bool operator<(const t_override& lhs, const t_override& rhs) {
std::initializer_list<short> left = { lhs.from_type, lhs.to_type, lhs.from.class, lhs.to_class, lhs.delta_x, lhs.delta_y };
std::initializer_list<short> right = { rhs.from_type, rhs.to_type, rhs.from.class, rhs.to_class, rhs.delta_x, rhs.delta_y };
return std::lexicographical_compare(left.begin(), left.end(),
right.begin(), right.end());
} It would also make it easier to extend later for anyone adding a field without having to bitfiddle. |
@hzeller placement time of original result: 37.86 so it seems changing std::tie to std::lexicographical_compare & initializer_list does not provide much performance benefit based on my measurement. |
Mmh, some thought to consider exploring: If the struct only contains these values and no holes, which it does: struct t_override {
short from_type;
short to_type;
short from_class;
short to_class;
short delta_x;
short delta_y;
};. Maybe we can treat it as array of 6 int16_t and cast it to some (Maybe CPUs/CPU-vendors might have int16 vector comparsion intrinsics (there I only see int32, but maybe there is int16 somewhere?) And if there are only intrinsics available on some platforms, we can fall back to some 'slow' version with Does signedness matter ? (in your code you essentially treat them as |
I would suggest you get out https://godbolt.org/ and see why the |
Oh yes, godbolt is funt. |
@hzeller |
@hzeller
With a implementation like this, the place time of picosoc is reduced from around 38s to around 36s, while gprof also reports around 5% reduction in placement time. Do you think if a solution like this is acceptable? I can also try more complicated solutions if necessary. |
Let's try an even simpler solution first (the simpler, the easier to maintain in the future). Can you performance compare your manual loop (while leaving the array assignment the same) with a call to
? It should do the same thing, but that way, we give the std library an opportunity to employ optimized versions that might be available on that system. So I suspect it will not be slower than the hand-written compare (but possibly faster), but it is at least shorter to write (and read). Another thing I'd try: we don't even have to copy the elements in the struct, we could just treat it as the array itself (if the struct is packed, so maybe compile-assert
|
Looks like using And casting the struct to pretend it is an array of short is better still. So I'd go with |
It seems based on the https://godbolt.org/z/xVc53j the result should be very similar, but when I actually run the test, manually implemented code is ~2s faster than lexicographical_compare. I have checked the assembly file of vpr for the manually implementation and lexicographical_compare and it seems lexicographical_compare is not inlined properly even with -O3 optimization flag. I found a function call like this within the operator<: and within lexicographical_compare_impl, the loop is not unrolled. Using lexicographical_compare still reduces the place time for picosoc by around 2s though, so I am not sure what will be the preference |
did you also try the casting the address of the struct to |
The code I dumped the assembly for is:
and the place time for this version is around 36s (compared with 38s) |
So I'd say choose the fastest version you came across: if the manual written version is faster, then use that (and write a comment that |
Based on my experiment, using Thank you so much for your help! I will create a pull request and run some qor benchmark tests later today! |
I think we should figure out why using the operator is not inlining and unrolling correctly. If it is not happening here then there are probably other places that it should be happening that are getting missed. My naive guess with no evidence is that it is likely to be something to do with compile units? BTW Are we using LTO? |
I think the LTO is currently used since the -flto is there, but I am not familiar with LTO so the sample command for linking vpr:
|
@xuqinziyue If the comparison method ends up in a separate compile unit to where it is used, then the compiler will be unable to inlining things. |
@mithro |
Just FYI - that function call is the following according to |
Just for more clarifications, I don't think there is a way to force the compiler to always inline a function in libraries. I haven't found anything that suggests std::lexicographical_compare is defined to be an inline function, and even if a function is defined as inline within the transition unit, compiler can still not inline the function. |
|
Easton: just discussing this in a meeting.
|
I have done some experiment with always_inline: When operator< is implemented as follows:
The operator< and the lexicographical_compare seems be inlined properly by compiler and the performance is similar to manual implementation. However when I change the implementation to
The operator< is not inlined properly again, and there is no reduction in execution time. I have collected QoR for the manual implementation and it looks like this: | original_results.txt | modified_results.txt I can collect QoR and titan result for the inline change and try the build types to see how it goes |
Thanks Easton. That code is significantly easier to read (would still need a clear comment explaining what it does and why it is coded that way). No idea why it doesn't inline as a friend function but as long as it inlines with one option we hopefully get the cpu improvement. Getting full QoR data makes sense. |
Update QoR result for inline change with vtr master branch as of March 18. It seems generally modified code runs faster this time maybe due to the machine (there should be no change in pack time and route time), but change in place time is larger which indicates the effect of function inlining | original_results.txt | modified_results.txt |
An update for Titan results for the inline change: following 3 tests did not run due to running out of disk space and data collected is based on other tests: | titan_original_results.txt | titan_modified_results.txt Titan results also suggest an around 5% reduction in place time. |
Looks good to me. I think you just have to write some nice comments and commit. |
Let's land your improvement. I do think we should investigate further what is going on with BTW To create a way to use It looks like it should also be supported by a modern clang to -> https://clang.llvm.org/docs/LanguageExtensions.html#has-attribute |
This issue has been inactive for a year and has been marked as stale. It will be closed in 15 days if it continues to be stale. If you believe this is still an issue, please add a comment. |
The always inline solution was implemented and merged so this issue is closed. |
comparison using std::tie in operator<(const t_override& lhs, const t_override& rhs) for struct t_override in place_delay_model.h seems to be slower than expected
Expected Behaviour
The function is supposed to only do several comparison between short numbers, so it is not expected to consume too much execution time.
Current Behaviour
gprof shows the function is consuming more than 8% of place time on picosoc benchmark test for symbiflow
Possible Solution
I replaced the std::tie with following code and there seems to be around 9% reduction in place time for the xc7 tests I measured on:
| Original place time | New place time
top_bram_n1 | 2.39 | 2.15
top_bram_n2 | 3.97 | 3.68
top_bram_n3 | 6.09 | 5.51
murax_basys | 6.3 | 5.69
picosoc_basys3 | 38.7 | 35.11
Your Environment
The text was updated successfully, but these errors were encountered: