std::tie is slower than expected for symbiflow benchmark tests #1225

xuqinziyue · 2020-03-20T03:52:48Z

comparison using std::tie in operator<(const t_override& lhs, const t_override& rhs) for struct t_override in place_delay_model.h seems to be slower than expected

Expected Behaviour

The function is supposed to only do several comparison between short numbers, so it is not expected to consume too much execution time.

Current Behaviour

gprof shows the function is consuming more than 8% of place time on picosoc benchmark test for symbiflow

Possible Solution

I replaced the std::tie with following code and there seems to be around 9% reduction in place time for the xc7 tests I measured on:

        unsigned long long temp_left = (unsigned long long)lhs.from_type<<48 | (unsigned long long)lhs.to_type<<32 | (unsigned long long)lhs.from_class << 16 | (unsigned long long)lhs.to_class;
        unsigned long long temp_right = (unsigned long long)rhs.from_type<<48 | (unsigned long long)rhs.to_type<<32 | (unsigned long long)rhs.from_class << 16 | (unsigned long long)rhs.to_class;
        if (temp_left < temp_right) {
            return true;
        }
        else if (temp_left == temp_right) {
            temp_left = ((unsigned)lhs.delta_x << 16 | (unsigned)lhs.delta_y);
            temp_right = ((unsigned)rhs.delta_x << 16 | (unsigned)rhs.delta_y);
            return temp_left < temp_right;
        }
        else {
            return false;
        }

| Original place time | New place time
top_bram_n1 | 2.39 | 2.15
top_bram_n2 | 3.97 | 3.68
top_bram_n3 | 6.09 | 5.51
murax_basys | 6.3 | 5.69
picosoc_basys3 | 38.7 | 35.11

Your Environment

VTR revision used: symbiflow/vtr with commit number 7351b7c
Operating System and version: linux

The text was updated successfully, but these errors were encountered:

vaughnbetz · 2020-03-20T19:21:58Z

Adding @kmurray to comment.
Easton, are other QoR metrics for Symbiflow unaffected (should be, but best to check).
The next step is to turn on the placement_delay override for the VTR and Titan benchmarks (see https://docs.verilogtorouting.org/en/latest/vpr/command_line_usage/) as it is a non-default option, and see what impact there is on place time (reduction hopefully) and other metrics (should be none).

Finally, the code is a bit nasty looking and difficult to comprehend. No action until you get the QoR data, but if it shows good QoR you'd need to comment it and make it as readable as you can. Could it be wrapped in a function and still maintain speed?

xuqinziyue · 2020-03-20T19:31:13Z

I have compared the vtr_stdout.log before and after the change, and I did not see any qor change there. I will run the qor and titan benchmark tests and see what's the qor result. I will create some inline functions and wrap the change into the functions.

mithro · 2020-03-20T20:32:55Z

@hzeller - Could you help @xuqinziyue find a solution which doesn't require the nasty code?

FYI - @litghost

hzeller · 2020-03-20T21:32:49Z

How does a primitive, but slightly more 'boring to read' solution like this measures, performance wise:

bool operator<(const t_override& lhs, const t_override& rhs) {
  std::initializer_list<short> left  = { lhs.from_type, lhs.to_type, lhs.from.class, lhs.to_class, lhs.delta_x, lhs.delta_y };
  std::initializer_list<short> right = { rhs.from_type, rhs.to_type, rhs.from.class, rhs.to_class, rhs.delta_x, rhs.delta_y };
  return std::lexicographical_compare(left.begin(), left.end(),
                                      right.begin(), right.end());
}

It would also make it easier to extend later for anyone adding a field without having to bitfiddle.

xuqinziyue · 2020-03-20T23:32:43Z

@hzeller
I have just done a quick measurement with picosoc and here's the result:

placement time of original result: 37.86
placement time of using initializer_list: 37.03
placement time with the code I posted above: 34.95

so it seems changing std::tie to std::lexicographical_compare & initializer_list does not provide much performance benefit based on my measurement.

hzeller · 2020-03-21T00:09:07Z

Mmh, some thought to consider exploring: If the struct only contains these values and no holes, which it does:

struct t_override {
        short from_type;
        short to_type;
        short from_class;
        short to_class;
        short delta_x;
        short delta_y;
};.

Maybe we can treat it as array of 6 int16_t and cast it to some int16_t* and do some trickery with some relative relative of memcmp() that can deal with short ?
(I'd made the struct packed then to make sure the layout is as we expect).

(Maybe wmemcmp() ? Though might be worthwhile to close test the assumptions w.r.t. signeness, sizeof(wchar_t) == sizeof(short) (which might not hold on typical platforms these days)).

CPUs/CPU-vendors might have int16 vector comparsion intrinsics (there I only see int32, but maybe there is int16 somewhere?)

And if there are only intrinsics available on some platforms, we can fall back to some 'slow' version with std::tie.

Does signedness matter ? (in your code you essentially treat them as unsigned short).
If not, then pretending this is an array of three uint32_t might work ?

mithro · 2020-03-21T00:15:37Z

I would suggest you get out https://godbolt.org/ and see why the std::tie and the faster proposed method don't end up with similar set of assembly instructions.

hzeller · 2020-03-21T00:20:11Z

Oh yes, godbolt is funt.
I suspect the std::tie and initializer list version create all kinds of branches while the data assembly version does better in this regard even though it has to slurp in the whole data vs. only the first mismatch in the other case.

xuqinziyue · 2020-03-21T01:47:37Z

@hzeller
Thanks for indicating the signedness issue. My code above only works if two numbers have the same sign (both positive or both negative), which means in some cases the implementation won't work. Also thanks for your other suggestions, I will do some measurements to see which option works better.

xuqinziyue · 2020-03-21T18:38:09Z

@hzeller
I tried some simple solutions like this:

      short left [6] = {lhs.from_type, lhs.to_type, lhs.from_class, lhs.to_class, lhs.delta_x, lhs.delta_y};
      short right [6]= {rhs.from_type, rhs.to_type, rhs.from_class, rhs.to_class, rhs.delta_x, rhs.delta_y};
      unsigned int i = 0;
      while (i < 6){
        if (left[i] == right[i]){
          i++;
        }
        else if (left[i] < right[i]){
          return true;
        }
        else{
          return false;
        }
      }
      return false;

With a implementation like this, the place time of picosoc is reduced from around 38s to around 36s, while gprof also reports around 5% reduction in placement time. Do you think if a solution like this is acceptable? I can also try more complicated solutions if necessary.

hzeller · 2020-03-21T19:16:00Z

Let's try an even simpler solution first (the simpler, the easier to maintain in the future).

Can you performance compare your manual loop (while leaving the array assignment the same) with a call to

std::lexicographical_compare(left, left+6, right, right+6)

? It should do the same thing, but that way, we give the std library an opportunity to employ optimized versions that might be available on that system. So I suspect it will not be slower than the hand-written compare (but possibly faster), but it is at least shorter to write (and read).

Another thing I'd try: we don't even have to copy the elements in the struct, we could just treat it as the array itself (if the struct is packed, so maybe compile-assert sizeof(that_struct_t) == sizeof(short * 6) - if not, we actually want to make that a packed struct; also we assume that the compiler layout is exactly the same sequence as we give it - though it is not really required to do so, given that they are all the same type they probably will).

   const short *left = reinterpret_cast<const short*>(&lhs);
   const short *right = reinterpret_cast<const short*>(&rhs);
   return std::lexicographical_compare(left, left+6, right, right+6);

hzeller · 2020-03-21T19:43:13Z

Looks like using std::lexicographical_compare vs. manual compare is very similar, in both cases the loop is unrolled with -O3.

And casting the struct to pretend it is an array of short is better still. So I'd go with reinterpret_cast<const short*>(&...) and std::lexicographical_compare()

https://godbolt.org/z/xVc53j

xuqinziyue · 2020-03-21T21:48:20Z

It seems based on the https://godbolt.org/z/xVc53j the result should be very similar, but when I actually run the test, manually implemented code is ~2s faster than lexicographical_compare.

I have checked the assembly file of vpr for the manually implementation and lexicographical_compare and it seems lexicographical_compare is not inlined properly even with -O3 optimization flag.

I found a function call like this within the operator<:
callq 5455c0 <ZSt30__lexicographical_compare_implIPKsS1_N9__gnu_cxx5__ops15_Iter_less_iterEEbT_S5_T0_S6_T1.isra.96.lto_priv.4077>

and within lexicographical_compare_impl, the loop is not unrolled.

Using lexicographical_compare still reduces the place time for picosoc by around 2s though, so I am not sure what will be the preference

hzeller · 2020-03-21T21:53:44Z

did you also try the casting the address of the struct to const short* ?

xuqinziyue · 2020-03-21T21:56:34Z

The code I dumped the assembly for is:

const short *left = reinterpret_cast<const short*>(&lhs);
const short *right = reinterpret_cast<const short*>(&rhs);
return std::lexicographical_compare(left, left+6, right, right+6);

and the place time for this version is around 36s (compared with 38s)

hzeller · 2020-03-21T22:37:53Z

So I'd say choose the fastest version you came across: if the manual written version is faster, then use that (and write a comment that std::lexicographical_compare was considered but slightly slower). If const short* casting is better than copy-to-array, use that.

xuqinziyue · 2020-03-21T22:41:38Z

Based on my experiment, using const short* casting is faster than copy to array, so the fastest version up to now will be const short* casting + manual comparison

Thank you so much for your help! I will create a pull request and run some qor benchmark tests later today!

mithro · 2020-03-21T22:52:44Z

@xuqinziyue

I have checked the assembly file of vpr for the manually implementation and lexicographical_compare and it seems lexicographical_compare is not inlined properly even with -O3 optimization flag.

I think we should figure out why using the operator is not inlining and unrolling correctly. If it is not happening here then there are probably other places that it should be happening that are getting missed.

My naive guess with no evidence is that it is likely to be something to do with compile units? BTW Are we using LTO?

xuqinziyue · 2020-03-21T23:10:39Z

I think we should figure out why using the operator is not inlining and unrolling correctly. If it is not happening here then there are probably other places that it should be happening that are getting missed.

My naive guess with no evidence is that it is likely to be something to do with compile units? BTW Are we using LTO?

I think the LTO is currently used since the -flto is there, but I am not familiar with LTO so the sample command for linking vpr:

cd /scratch/local/xuqinziy/vtr-verilog-to-routing/build/vpr && /tools/cmake/install/3.9.0/bin/cmake -E cmake_link_script CMakeFiles/vpr.dir/link.txt --verbose=1
/usr/bin/c++ -O3 -DNDEBUG -flto -fno-fat-lto-objects CMakeFiles/vpr.dir/src/main.cpp.o CMakeFiles/vpr.dir/resources.C.o -o vpr libvpr.a ../libs/libarchfpga/libarchfpga.a ../libs/libpugiutil/libpugiutil.a ../libs/EXTERNAL/libsdcparse/libsdcparse.a ../libs/EXTERNAL/libblifparse/libblifparse.a ../libs/EXTERNAL/libtatum/libtatum/libtatum.a ../libs/EXTERNAL/libargparse/libargparse.a ../libs/EXTERNAL/libpugixml/libpugixml.a ../libs/EXTERNAL/libezgl/libezgl.a -lgtk-3 -lgdk-3 -latk-1.0 -lgio-2.0 -lpangocairo-1.0 -lgdk_pixbuf-2.0 -lcairo-gobject -lpango-1.0 -lcairo -lgobject-2.0 -lglib-2.0 -lX11 ../libs/libvtrcapnproto/liblibvtrcapnproto.a ../libs/libvtrutil/libvtrutil.a ../libs/liblog/liblog.a ../libs/EXTERNAL/capnproto/c++/src/capnp/libcapnp.a ../libs/EXTERNAL/capnproto/c++/src/kj/libkj.a -lpthread

mithro · 2020-03-22T00:07:22Z

@xuqinziyue If the comparison method ends up in a separate compile unit to where it is used, then the compiler will be unable to inlining things.

xuqinziyue · 2020-03-22T01:40:27Z

@mithro
The function that is not inlined is std::lexicographical_compare, and I am not sure what's the rule to be applied to a standard library like this.

mithro · 2020-03-22T18:20:44Z

https://en.cppreference.com/w/cpp/language/inline and https://gcc.gnu.org/onlinedocs/gcc/Inline.html might help.

mithro · 2020-03-22T18:28:04Z

Just FYI - that function call is the following according to c++filt
<bool std::__lexicographical_compare_impl<short const*, short const*, __gnu_cxx::__ops::_Iter_less_iter>(short const*, short const*, short const*, short const*, __gnu_cxx::__ops::_Iter_less_iter) [clone .isra.96] [clone .lto_priv.4077]>

xuqinziyue · 2020-03-22T19:30:26Z

Just for more clarifications, I don't think there is a way to force the compiler to always inline a function in libraries. I haven't found anything that suggests std::lexicographical_compare is defined to be an inline function, and even if a function is defined as inline within the transition unit, compiler can still not inline the function.
So isn't that compiler's choice on whether to inline the function or not?

mithro · 2020-03-22T22:54:28Z

/* Prototype.  */
inline void foo (const char) __attribute__((always_inline));

vaughnbetz · 2020-03-26T18:37:03Z

Easton: just discussing this in a meeting.
Still confusion about why this isn't inlined properly by the compiler; ideally the fix would be to get this inlined. Maybe there is a bigger gain to be had if we figure out why inlining isn't happening (if it is a general issue)?

Does Tim's always_inline work?
Try make build_type = release_pgo to see if that also captures the cpu gains.
Is this also an issue on bigger designs (VTR, Titan). Need QoR data for them.

xuqinziyue · 2020-03-30T19:54:51Z

I have done some experiment with always_inline:

When operator< is implemented as follows:

 friend inline bool operator<(const t_override& lhs, const t_override& rhs) __attribute__((always_inline)) {
    const short *left = reinterpret_cast<const short*>(&lhs);
    const short *right = reinterpret_cast<const short*>(&rhs); 
    return std::lexicographical_compare(left, left+6, right, right+6);
}

The operator< and the lexicographical_compare seems be inlined properly by compiler and the performance is similar to manual implementation.

However when I change the implementation to

 friend inline bool operator<(const t_override& lhs, const t_override& rhs) __attribute__((always_inline)) {
    return std::tie(lhs.from_type, lhs.to_type, lhs.from_class, lhs.to_class, lhs.delta_x, lhs.delta_y)
                   < std::tie(rhs.from_type, rhs.to_type, rhs.from_class, rhs.to_class, rhs.delta_x, rhs.delta_y);
}

The operator< is not inlined properly again, and there is no reduction in execution time.

I have collected QoR for the manual implementation and it looks like this:

| original_results.txt | modified_results.txt
vtr_flow_elapsed_time | 1 | 0.997897361
odin_synth_time | 1 | 0.939841239
abc_depth | 1 | 1
abc_synth_time | 1 | 1.058311689
num_clb | 1 | 1
num_memories | 1 | 1
num_mult | 1 | 1
max_vpr_mem | 1 | 1.001470174
num_pre_packed_blocks | 1 | 1
num_post_packed_blocks | 1 | 1
device_grid_tiles | 1 | 1
pack_time | 1 | 0.981313116
placed_wirelength_est | 1 | 1
place_time | 1 | 0.950503869
placed_CPD_est | 1 | 1
min_chan_width | 1 | 1
routed_wirelength | 1 | 1
min_chan_width_route_time | 1 | 1.01539243
crit_path_routed_wirelength | 1 | 1
critical_path_delay | 1 | 1
crit_path_route_time | 1 | 0.991334014

I can collect QoR and titan result for the inline change and try the build types to see how it goes

vaughnbetz · 2020-03-30T20:09:34Z

Thanks Easton. That code is significantly easier to read (would still need a clear comment explaining what it does and why it is coded that way). No idea why it doesn't inline as a friend function but as long as it inlines with one option we hopefully get the cpu improvement. Getting full QoR data makes sense.

xuqinziyue · 2020-03-31T17:40:44Z

Update QoR result for inline change with vtr master branch as of March 18. It seems generally modified code runs faster this time maybe due to the machine (there should be no change in pack time and route time), but change in place time is larger which indicates the effect of function inlining

| original_results.txt | modified_results.txt
vtr_flow_elapsed_time | 1 | 0.98708214
odin_synth_time | 1 | 1.078524821
abc_depth | 1 | 1
abc_synth_time | 1 | 0.986797326
num_clb | 1 | 1
num_memories | 1 | 1
num_mult | 1 | 1
max_vpr_mem | 1 | 1.000129628
num_pre_packed_blocks | 1 | 1
num_post_packed_blocks | 1 | 1
device_grid_tiles | 1 | 1
pack_time | 1 | 0.974220108
placed_wirelength_est | 1 | 1
place_time | 1 | 0.94762915
placed_CPD_est | 1 | 1
min_chan_width | 1 | 1
routed_wirelength | 1 | 1
min_chan_width_route_time | 1 | 1.01647566
crit_path_routed_wirelength | 1 | 1
critical_path_delay | 1 | 1
crit_path_route_time | 1 | 0.969925067

xuqinziyue · 2020-04-03T21:22:58Z

An update for Titan results for the inline change:

following 3 tests did not run due to running out of disk space and data collected is based on other tests:
segmentation_stratixiv_arch_timing.blif
SLAM_spheric_stratixiv_arch_timing.blif
des90_stratixiv_arch_timing.blif

| titan_original_results.txt | titan_modified_results.txt
vtr_flow_elapsed_time | 1 | 0.975597952
num_LAB | 1 | 1
num_DSP | 1 | 1
num_M9K | 1 | 1
num_M144K | 1 | 1
max_vpr_mem | 1 | 1.000010433
num_pre_packed_blocks | 1 | 1
num_post_packed_blocks | 1 | 1
device_grid_tiles | 1 | 1
pack_time | 1 | 1.01225225
placed_wirelength_est | 1 | 1
place_time | 1 | 0.954590719
placed_CPD_est | 1 | 1
routed_wirelength | 1 | 1
critical_path_delay | 1 | 1
crit_path_route_time | 1 | 1.027508016

Titan results also suggest an around 5% reduction in place time.

vaughnbetz · 2020-04-03T23:12:16Z

Looks good to me. I think you just have to write some nice comments and commit.

mithro · 2020-04-04T00:17:21Z

Let's land your improvement.

I do think we should investigate further what is going on with std::tie stuff you mentioned here -> #1225 (comment) but that is a task for someone in the future.

BTW To create a way to use always_inline attribute in a portable fashion, see the example in the Chromium source code at https://chromium.googlesource.com/chromium/src/base/+/refs/heads/master/compiler_specific.h#42

It looks like it should also be supported by a modern clang to -> https://clang.llvm.org/docs/LanguageExtensions.html#has-attribute

github-actions · 2025-05-02T13:27:02Z

This issue has been inactive for a year and has been marked as stale. It will be closed in 15 days if it continues to be stale. If you believe this is still an issue, please add a comment.

vaughnbetz · 2025-05-16T21:09:43Z

The always inline solution was implemented and merged so this issue is closed.

xuqinziyue mentioned this issue Apr 5, 2020

Force operator< to be inlined to reduce place time #1265

Merged

7 tasks

github-actions bot added the Stale label May 2, 2025

AmirhosseinPoolad mentioned this issue May 16, 2025

[Place] Unnecessary microoptimization in override_delay_model #3049

Open

AmirhosseinPoolad closed this as completed May 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std::tie is slower than expected for symbiflow benchmark tests #1225

std::tie is slower than expected for symbiflow benchmark tests #1225

xuqinziyue commented Mar 20, 2020

vaughnbetz commented Mar 20, 2020

xuqinziyue commented Mar 20, 2020 •

edited

Loading

mithro commented Mar 20, 2020

hzeller commented Mar 20, 2020

xuqinziyue commented Mar 20, 2020

hzeller commented Mar 21, 2020

mithro commented Mar 21, 2020

hzeller commented Mar 21, 2020

xuqinziyue commented Mar 21, 2020

xuqinziyue commented Mar 21, 2020 •

edited

Loading

hzeller commented Mar 21, 2020 •

edited

Loading

hzeller commented Mar 21, 2020 •

edited

Loading

xuqinziyue commented Mar 21, 2020 •

edited

Loading

hzeller commented Mar 21, 2020

xuqinziyue commented Mar 21, 2020 •

edited

Loading

hzeller commented Mar 21, 2020

xuqinziyue commented Mar 21, 2020

mithro commented Mar 21, 2020

xuqinziyue commented Mar 21, 2020

mithro commented Mar 22, 2020

xuqinziyue commented Mar 22, 2020

mithro commented Mar 22, 2020

mithro commented Mar 22, 2020

xuqinziyue commented Mar 22, 2020

mithro commented Mar 22, 2020

vaughnbetz commented Mar 26, 2020

xuqinziyue commented Mar 30, 2020

vaughnbetz commented Mar 30, 2020

xuqinziyue commented Mar 31, 2020 •

edited

Loading

xuqinziyue commented Apr 3, 2020

vaughnbetz commented Apr 3, 2020

mithro commented Apr 4, 2020

github-actions bot commented May 2, 2025

vaughnbetz commented May 16, 2025

std::tie is slower than expected for symbiflow benchmark tests #1225

std::tie is slower than expected for symbiflow benchmark tests #1225

Comments

xuqinziyue commented Mar 20, 2020

Expected Behaviour

Current Behaviour

Possible Solution

Your Environment

vaughnbetz commented Mar 20, 2020

xuqinziyue commented Mar 20, 2020 • edited Loading

mithro commented Mar 20, 2020

hzeller commented Mar 20, 2020

xuqinziyue commented Mar 20, 2020

hzeller commented Mar 21, 2020

mithro commented Mar 21, 2020

hzeller commented Mar 21, 2020

xuqinziyue commented Mar 21, 2020

xuqinziyue commented Mar 21, 2020 • edited Loading

hzeller commented Mar 21, 2020 • edited Loading

hzeller commented Mar 21, 2020 • edited Loading

xuqinziyue commented Mar 21, 2020 • edited Loading

hzeller commented Mar 21, 2020

xuqinziyue commented Mar 21, 2020 • edited Loading

hzeller commented Mar 21, 2020

xuqinziyue commented Mar 21, 2020

mithro commented Mar 21, 2020

xuqinziyue commented Mar 21, 2020

mithro commented Mar 22, 2020

xuqinziyue commented Mar 22, 2020

mithro commented Mar 22, 2020

mithro commented Mar 22, 2020

xuqinziyue commented Mar 22, 2020

mithro commented Mar 22, 2020

vaughnbetz commented Mar 26, 2020

xuqinziyue commented Mar 30, 2020

vaughnbetz commented Mar 30, 2020

xuqinziyue commented Mar 31, 2020 • edited Loading

xuqinziyue commented Apr 3, 2020

vaughnbetz commented Apr 3, 2020

mithro commented Apr 4, 2020

github-actions bot commented May 2, 2025

vaughnbetz commented May 16, 2025

xuqinziyue commented Mar 20, 2020 •

edited

Loading

xuqinziyue commented Mar 21, 2020 •

edited

Loading

hzeller commented Mar 21, 2020 •

edited

Loading

hzeller commented Mar 21, 2020 •

edited

Loading

xuqinziyue commented Mar 21, 2020 •

edited

Loading

xuqinziyue commented Mar 21, 2020 •

edited

Loading

xuqinziyue commented Mar 31, 2020 •

edited

Loading