Skip to content

std::tie is slower than expected for symbiflow benchmark tests #1225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xuqinziyue opened this issue Mar 20, 2020 · 34 comments
Closed

std::tie is slower than expected for symbiflow benchmark tests #1225

xuqinziyue opened this issue Mar 20, 2020 · 34 comments
Labels

Comments

@xuqinziyue
Copy link
Contributor

comparison using std::tie in operator<(const t_override& lhs, const t_override& rhs) for struct t_override in place_delay_model.h seems to be slower than expected

Expected Behaviour

The function is supposed to only do several comparison between short numbers, so it is not expected to consume too much execution time.

Current Behaviour

gprof shows the function is consuming more than 8% of place time on picosoc benchmark test for symbiflow

Possible Solution

I replaced the std::tie with following code and there seems to be around 9% reduction in place time for the xc7 tests I measured on:

        unsigned long long temp_left = (unsigned long long)lhs.from_type<<48 | (unsigned long long)lhs.to_type<<32 | (unsigned long long)lhs.from_class << 16 | (unsigned long long)lhs.to_class;
        unsigned long long temp_right = (unsigned long long)rhs.from_type<<48 | (unsigned long long)rhs.to_type<<32 | (unsigned long long)rhs.from_class << 16 | (unsigned long long)rhs.to_class;
        if (temp_left < temp_right) {
            return true;
        }
        else if (temp_left == temp_right) {
            temp_left = ((unsigned)lhs.delta_x << 16 | (unsigned)lhs.delta_y);
            temp_right = ((unsigned)rhs.delta_x << 16 | (unsigned)rhs.delta_y);
            return temp_left < temp_right;
        }
        else {
            return false;
        }

  | Original place time | New place time
top_bram_n1 | 2.39 | 2.15
top_bram_n2 | 3.97 | 3.68
top_bram_n3 | 6.09 | 5.51
murax_basys | 6.3 | 5.69
picosoc_basys3 | 38.7 | 35.11

Your Environment

  • VTR revision used: symbiflow/vtr with commit number 7351b7c
  • Operating System and version: linux
@vaughnbetz
Copy link
Contributor

Adding @kmurray to comment.
Easton, are other QoR metrics for Symbiflow unaffected (should be, but best to check).
The next step is to turn on the placement_delay override for the VTR and Titan benchmarks (see https://docs.verilogtorouting.org/en/latest/vpr/command_line_usage/) as it is a non-default option, and see what impact there is on place time (reduction hopefully) and other metrics (should be none).

Finally, the code is a bit nasty looking and difficult to comprehend. No action until you get the QoR data, but if it shows good QoR you'd need to comment it and make it as readable as you can. Could it be wrapped in a function and still maintain speed?

@xuqinziyue
Copy link
Contributor Author

xuqinziyue commented Mar 20, 2020

I have compared the vtr_stdout.log before and after the change, and I did not see any qor change there. I will run the qor and titan benchmark tests and see what's the qor result. I will create some inline functions and wrap the change into the functions.

@mithro
Copy link
Contributor

mithro commented Mar 20, 2020

@hzeller - Could you help @xuqinziyue find a solution which doesn't require the nasty code?

FYI - @litghost

@hzeller
Copy link
Contributor

hzeller commented Mar 20, 2020

How does a primitive, but slightly more 'boring to read' solution like this measures, performance wise:

bool operator<(const t_override& lhs, const t_override& rhs) {
  std::initializer_list<short> left  = { lhs.from_type, lhs.to_type, lhs.from.class, lhs.to_class, lhs.delta_x, lhs.delta_y };
  std::initializer_list<short> right = { rhs.from_type, rhs.to_type, rhs.from.class, rhs.to_class, rhs.delta_x, rhs.delta_y };
  return std::lexicographical_compare(left.begin(), left.end(),
                                      right.begin(), right.end());
}

It would also make it easier to extend later for anyone adding a field without having to bitfiddle.

@xuqinziyue
Copy link
Contributor Author

@hzeller
I have just done a quick measurement with picosoc and here's the result:

placement time of original result: 37.86
placement time of using initializer_list: 37.03
placement time with the code I posted above: 34.95

so it seems changing std::tie to std::lexicographical_compare & initializer_list does not provide much performance benefit based on my measurement.

@hzeller
Copy link
Contributor

hzeller commented Mar 21, 2020

Mmh, some thought to consider exploring: If the struct only contains these values and no holes, which it does:

struct t_override {
        short from_type;
        short to_type;
        short from_class;
        short to_class;
        short delta_x;
        short delta_y;
};.

Maybe we can treat it as array of 6 int16_t and cast it to some int16_t* and do some trickery with some relative relative of memcmp() that can deal with short ?
(I'd made the struct packed then to make sure the layout is as we expect).

(Maybe wmemcmp() ? Though might be worthwhile to close test the assumptions w.r.t. signeness, sizeof(wchar_t) == sizeof(short) (which might not hold on typical platforms these days)).

CPUs/CPU-vendors might have int16 vector comparsion intrinsics (there I only see int32, but maybe there is int16 somewhere?)

And if there are only intrinsics available on some platforms, we can fall back to some 'slow' version with std::tie.

Does signedness matter ? (in your code you essentially treat them as unsigned short).
If not, then pretending this is an array of three uint32_t might work ?

@mithro
Copy link
Contributor

mithro commented Mar 21, 2020

I would suggest you get out https://godbolt.org/ and see why the std::tie and the faster proposed method don't end up with similar set of assembly instructions.

@hzeller
Copy link
Contributor

hzeller commented Mar 21, 2020

Oh yes, godbolt is funt.
I suspect the std::tie and initializer list version create all kinds of branches while the data assembly version does better in this regard even though it has to slurp in the whole data vs. only the first mismatch in the other case.

@xuqinziyue
Copy link
Contributor Author

@hzeller
Thanks for indicating the signedness issue. My code above only works if two numbers have the same sign (both positive or both negative), which means in some cases the implementation won't work. Also thanks for your other suggestions, I will do some measurements to see which option works better.

@xuqinziyue
Copy link
Contributor Author

xuqinziyue commented Mar 21, 2020

@hzeller
I tried some simple solutions like this:

      short left [6] = {lhs.from_type, lhs.to_type, lhs.from_class, lhs.to_class, lhs.delta_x, lhs.delta_y};
      short right [6]= {rhs.from_type, rhs.to_type, rhs.from_class, rhs.to_class, rhs.delta_x, rhs.delta_y};
      unsigned int i = 0;
      while (i < 6){
        if (left[i] == right[i]){
          i++;
        }
        else if (left[i] < right[i]){
          return true;
        }
        else{
          return false;
        }
      }
      return false;

With a implementation like this, the place time of picosoc is reduced from around 38s to around 36s, while gprof also reports around 5% reduction in placement time. Do you think if a solution like this is acceptable? I can also try more complicated solutions if necessary.

@hzeller
Copy link
Contributor

hzeller commented Mar 21, 2020

Let's try an even simpler solution first (the simpler, the easier to maintain in the future).

Can you performance compare your manual loop (while leaving the array assignment the same) with a call to

std::lexicographical_compare(left, left+6, right, right+6)

? It should do the same thing, but that way, we give the std library an opportunity to employ optimized versions that might be available on that system. So I suspect it will not be slower than the hand-written compare (but possibly faster), but it is at least shorter to write (and read).

Another thing I'd try: we don't even have to copy the elements in the struct, we could just treat it as the array itself (if the struct is packed, so maybe compile-assert sizeof(that_struct_t) == sizeof(short * 6) - if not, we actually want to make that a packed struct; also we assume that the compiler layout is exactly the same sequence as we give it - though it is not really required to do so, given that they are all the same type they probably will).

   const short *left = reinterpret_cast<const short*>(&lhs);
   const short *right = reinterpret_cast<const short*>(&rhs);
   return std::lexicographical_compare(left, left+6, right, right+6);

@hzeller
Copy link
Contributor

hzeller commented Mar 21, 2020

Looks like using std::lexicographical_compare vs. manual compare is very similar, in both cases the loop is unrolled with -O3.

And casting the struct to pretend it is an array of short is better still. So I'd go with reinterpret_cast<const short*>(&...) and std::lexicographical_compare()

https://godbolt.org/z/xVc53j

@xuqinziyue
Copy link
Contributor Author

xuqinziyue commented Mar 21, 2020

It seems based on the https://godbolt.org/z/xVc53j the result should be very similar, but when I actually run the test, manually implemented code is ~2s faster than lexicographical_compare.

I have checked the assembly file of vpr for the manually implementation and lexicographical_compare and it seems lexicographical_compare is not inlined properly even with -O3 optimization flag.

I found a function call like this within the operator<:
callq 5455c0 <ZSt30__lexicographical_compare_implIPKsS1_N9__gnu_cxx5__ops15_Iter_less_iterEEbT_S5_T0_S6_T1.isra.96.lto_priv.4077>

and within lexicographical_compare_impl, the loop is not unrolled.

Using lexicographical_compare still reduces the place time for picosoc by around 2s though, so I am not sure what will be the preference

@hzeller
Copy link
Contributor

hzeller commented Mar 21, 2020

did you also try the casting the address of the struct to const short* ?

@xuqinziyue
Copy link
Contributor Author

xuqinziyue commented Mar 21, 2020

The code I dumped the assembly for is:

const short *left = reinterpret_cast<const short*>(&lhs);
const short *right = reinterpret_cast<const short*>(&rhs);
return std::lexicographical_compare(left, left+6, right, right+6);

and the place time for this version is around 36s (compared with 38s)

@hzeller
Copy link
Contributor

hzeller commented Mar 21, 2020

So I'd say choose the fastest version you came across: if the manual written version is faster, then use that (and write a comment that std::lexicographical_compare was considered but slightly slower). If const short* casting is better than copy-to-array, use that.

@xuqinziyue
Copy link
Contributor Author

Based on my experiment, using const short* casting is faster than copy to array, so the fastest version up to now will be const short* casting + manual comparison

Thank you so much for your help! I will create a pull request and run some qor benchmark tests later today!

@mithro
Copy link
Contributor

mithro commented Mar 21, 2020

@xuqinziyue

I have checked the assembly file of vpr for the manually implementation and lexicographical_compare and it seems lexicographical_compare is not inlined properly even with -O3 optimization flag.

I think we should figure out why using the operator is not inlining and unrolling correctly. If it is not happening here then there are probably other places that it should be happening that are getting missed.

My naive guess with no evidence is that it is likely to be something to do with compile units? BTW Are we using LTO?

@xuqinziyue
Copy link
Contributor Author

I think we should figure out why using the operator is not inlining and unrolling correctly. If it is not happening here then there are probably other places that it should be happening that are getting missed.

My naive guess with no evidence is that it is likely to be something to do with compile units? BTW Are we using LTO?

I think the LTO is currently used since the -flto is there, but I am not familiar with LTO so the sample command for linking vpr:

cd /scratch/local/xuqinziy/vtr-verilog-to-routing/build/vpr && /tools/cmake/install/3.9.0/bin/cmake -E cmake_link_script CMakeFiles/vpr.dir/link.txt --verbose=1
/usr/bin/c++ -O3 -DNDEBUG -flto -fno-fat-lto-objects CMakeFiles/vpr.dir/src/main.cpp.o CMakeFiles/vpr.dir/resources.C.o -o vpr libvpr.a ../libs/libarchfpga/libarchfpga.a ../libs/libpugiutil/libpugiutil.a ../libs/EXTERNAL/libsdcparse/libsdcparse.a ../libs/EXTERNAL/libblifparse/libblifparse.a ../libs/EXTERNAL/libtatum/libtatum/libtatum.a ../libs/EXTERNAL/libargparse/libargparse.a ../libs/EXTERNAL/libpugixml/libpugixml.a ../libs/EXTERNAL/libezgl/libezgl.a -lgtk-3 -lgdk-3 -latk-1.0 -lgio-2.0 -lpangocairo-1.0 -lgdk_pixbuf-2.0 -lcairo-gobject -lpango-1.0 -lcairo -lgobject-2.0 -lglib-2.0 -lX11 ../libs/libvtrcapnproto/liblibvtrcapnproto.a ../libs/libvtrutil/libvtrutil.a ../libs/liblog/liblog.a ../libs/EXTERNAL/capnproto/c++/src/capnp/libcapnp.a ../libs/EXTERNAL/capnproto/c++/src/kj/libkj.a -lpthread

@mithro
Copy link
Contributor

mithro commented Mar 22, 2020

@xuqinziyue If the comparison method ends up in a separate compile unit to where it is used, then the compiler will be unable to inlining things.

@xuqinziyue
Copy link
Contributor Author

@mithro
The function that is not inlined is std::lexicographical_compare, and I am not sure what's the rule to be applied to a standard library like this.

@mithro
Copy link
Contributor

mithro commented Mar 22, 2020

@mithro
Copy link
Contributor

mithro commented Mar 22, 2020

Just FYI - that function call is the following according to c++filt
<bool std::__lexicographical_compare_impl<short const*, short const*, __gnu_cxx::__ops::_Iter_less_iter>(short const*, short const*, short const*, short const*, __gnu_cxx::__ops::_Iter_less_iter) [clone .isra.96] [clone .lto_priv.4077]>

@xuqinziyue
Copy link
Contributor Author

Just for more clarifications, I don't think there is a way to force the compiler to always inline a function in libraries. I haven't found anything that suggests std::lexicographical_compare is defined to be an inline function, and even if a function is defined as inline within the transition unit, compiler can still not inline the function.
So isn't that compiler's choice on whether to inline the function or not?

@mithro
Copy link
Contributor

mithro commented Mar 22, 2020

/* Prototype.  */
inline void foo (const char) __attribute__((always_inline));

@vaughnbetz
Copy link
Contributor

Easton: just discussing this in a meeting.
Still confusion about why this isn't inlined properly by the compiler; ideally the fix would be to get this inlined. Maybe there is a bigger gain to be had if we figure out why inlining isn't happening (if it is a general issue)?

  1. Does Tim's always_inline work?
  2. Try make build_type = release_pgo to see if that also captures the cpu gains.
  3. Is this also an issue on bigger designs (VTR, Titan). Need QoR data for them.

@xuqinziyue
Copy link
Contributor Author

I have done some experiment with always_inline:

When operator< is implemented as follows:

 friend inline bool operator<(const t_override& lhs, const t_override& rhs) __attribute__((always_inline)) {
    const short *left = reinterpret_cast<const short*>(&lhs);
    const short *right = reinterpret_cast<const short*>(&rhs); 
    return std::lexicographical_compare(left, left+6, right, right+6);
}

The operator< and the lexicographical_compare seems be inlined properly by compiler and the performance is similar to manual implementation.

However when I change the implementation to

 friend inline bool operator<(const t_override& lhs, const t_override& rhs) __attribute__((always_inline)) {
    return std::tie(lhs.from_type, lhs.to_type, lhs.from_class, lhs.to_class, lhs.delta_x, lhs.delta_y)
                   < std::tie(rhs.from_type, rhs.to_type, rhs.from_class, rhs.to_class, rhs.delta_x, rhs.delta_y);
}

The operator< is not inlined properly again, and there is no reduction in execution time.

I have collected QoR for the manual implementation and it looks like this:

  | original_results.txt | modified_results.txt
vtr_flow_elapsed_time | 1 | 0.997897361
odin_synth_time | 1 | 0.939841239
abc_depth | 1 | 1
abc_synth_time | 1 | 1.058311689
num_clb | 1 | 1
num_memories | 1 | 1
num_mult | 1 | 1
max_vpr_mem | 1 | 1.001470174
num_pre_packed_blocks | 1 | 1
num_post_packed_blocks | 1 | 1
device_grid_tiles | 1 | 1
pack_time | 1 | 0.981313116
placed_wirelength_est | 1 | 1
place_time | 1 | 0.950503869
placed_CPD_est | 1 | 1
min_chan_width | 1 | 1
routed_wirelength | 1 | 1
min_chan_width_route_time | 1 | 1.01539243
crit_path_routed_wirelength | 1 | 1
critical_path_delay | 1 | 1
crit_path_route_time | 1 | 0.991334014

I can collect QoR and titan result for the inline change and try the build types to see how it goes

@vaughnbetz
Copy link
Contributor

Thanks Easton. That code is significantly easier to read (would still need a clear comment explaining what it does and why it is coded that way). No idea why it doesn't inline as a friend function but as long as it inlines with one option we hopefully get the cpu improvement. Getting full QoR data makes sense.

@xuqinziyue
Copy link
Contributor Author

xuqinziyue commented Mar 31, 2020

Update QoR result for inline change with vtr master branch as of March 18. It seems generally modified code runs faster this time maybe due to the machine (there should be no change in pack time and route time), but change in place time is larger which indicates the effect of function inlining

  | original_results.txt | modified_results.txt
vtr_flow_elapsed_time | 1 | 0.98708214
odin_synth_time | 1 | 1.078524821
abc_depth | 1 | 1
abc_synth_time | 1 | 0.986797326
num_clb | 1 | 1
num_memories | 1 | 1
num_mult | 1 | 1
max_vpr_mem | 1 | 1.000129628
num_pre_packed_blocks | 1 | 1
num_post_packed_blocks | 1 | 1
device_grid_tiles | 1 | 1
pack_time | 1 | 0.974220108
placed_wirelength_est | 1 | 1
place_time | 1 | 0.94762915
placed_CPD_est | 1 | 1
min_chan_width | 1 | 1
routed_wirelength | 1 | 1
min_chan_width_route_time | 1 | 1.01647566
crit_path_routed_wirelength | 1 | 1
critical_path_delay | 1 | 1
crit_path_route_time | 1 | 0.969925067

@xuqinziyue
Copy link
Contributor Author

An update for Titan results for the inline change:

following 3 tests did not run due to running out of disk space and data collected is based on other tests:
segmentation_stratixiv_arch_timing.blif
SLAM_spheric_stratixiv_arch_timing.blif
des90_stratixiv_arch_timing.blif

  | titan_original_results.txt | titan_modified_results.txt
vtr_flow_elapsed_time | 1 | 0.975597952
num_LAB | 1 | 1
num_DSP | 1 | 1
num_M9K | 1 | 1
num_M144K | 1 | 1
max_vpr_mem | 1 | 1.000010433
num_pre_packed_blocks | 1 | 1
num_post_packed_blocks | 1 | 1
device_grid_tiles | 1 | 1
pack_time | 1 | 1.01225225
placed_wirelength_est | 1 | 1
place_time | 1 | 0.954590719
placed_CPD_est | 1 | 1
routed_wirelength | 1 | 1
critical_path_delay | 1 | 1
crit_path_route_time | 1 | 1.027508016

Titan results also suggest an around 5% reduction in place time.

@vaughnbetz
Copy link
Contributor

Looks good to me. I think you just have to write some nice comments and commit.

@mithro
Copy link
Contributor

mithro commented Apr 4, 2020

Let's land your improvement.

I do think we should investigate further what is going on with std::tie stuff you mentioned here -> #1225 (comment) but that is a task for someone in the future.

BTW To create a way to use always_inline attribute in a portable fashion, see the example in the Chromium source code at https://chromium.googlesource.com/chromium/src/base/+/refs/heads/master/compiler_specific.h#42

It looks like it should also be supported by a modern clang to -> https://clang.llvm.org/docs/LanguageExtensions.html#has-attribute

Copy link

github-actions bot commented May 2, 2025

This issue has been inactive for a year and has been marked as stale. It will be closed in 15 days if it continues to be stale. If you believe this is still an issue, please add a comment.

@vaughnbetz
Copy link
Contributor

The always inline solution was implemented and merged so this issue is closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants