Add PlacerSetupSlacks interface #1450

Bill-hbrhbr · 2020-07-23T23:53:34Z

Description

Add an interface for a 1-to-1 mapping between CLB pins and setup slacks from the timing analyzer. Refactored place.cpp (replaced recompute_criticalities with three new routines) so that setup slacks and criticalities can be updated together or separately.

Types of changes

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation
I have updated the documentation accordingly
I have added tests to cover my changes
All new and existing tests passed

…ored PlacerCriticalities, and created PlacerSetupSlacks, so that they can choose between doing incremental V.S. from scratch updates.

…pdate. Added checks to see if the updates need to be done from scratch or can be done incrementally

Bill-hbrhbr · 2020-07-23T23:57:41Z

The current question is that the three new routines in place.cpp:

update_setup_slacks
update_criticalities
update_setup_slacks_and_criticalities

can be merged into a single routine by utilizing some boolean flags.

The tradeoff is between code duplication and branch prediction. So far, only update_criticalities (originally recompute_criticalities) is called, and it is called once in a while (so it shouldn't be that hot of a function).

I'm inclined to merge these three routines into a single one, but in the future, we might frequently call these update routines in the placement quench part of the code.

vaughnbetz · 2020-07-27T21:56:06Z

vpr/src/place/timing_place.h

+ * clustered netlist pins/connections which have had their setup slacks modified by 
+ * the last call to update_setup_slacks().
+ */
+class PlacerSetupSlacks {


Explain that these are raw slacks (can be negative, based on the timing constraint specified by the user or the very difficult automatically set timing constraints if no user-specified one exists).

vaughnbetz · 2020-07-27T21:59:03Z

It is possible that this would be cleaner with one function that always updated criticalities and slacks. Once you've played with the algorithm some, you could test if always updating both criticalities and slacks in one routine has a significant cost or not.

…cks. The matrix update is incremental according to the pins with modified setup slacks returned from PlacerSetupSlacks. Outer loop routine now updates both setup slacks and criticalities, while the inner loop routine passes in variables that determine the strategies/cost functions used to evaluate the effectiveness of try_swap moves.

…a series of successful moves done by try_swap. Right now the data structures representing the state variables are directly being copied, however the process can possibly be optimized with incremental techniques. The snapshot routines are called in the placement's inner loop, and should be used together with VPR options quench_recompute_divider and less optimally inner_recompute_divider. The latter would be too time consuming in practice.

script)

…ng placement snapshots). Currently experiencing consistency failures. Also updated slack analysis cost function: comparing the worse slack change across all modified clb pins

…ing the placement quench stage. Made commit_td_cost method incremental by only going through sink pins affected by the moved blocks.

…o a new local structure called t_placer_timing_update_mode to tidy up the code.

Bill-hbrhbr · 2020-08-14T19:29:05Z

VTR benchmarks @ commit 112bde5: https://drive.google.com/file/d/1JRfVQWIme2yldLiOQBPWuD9R2-akIrl_/view?usp=sharing

…lysis during placement quench. Possible options are: auto, timing_cost, setup_slack.

vpr/src/place/place.cpp

vaughnbetz · 2020-08-21T00:36:29Z

vpr/src/place/place.cpp

-                                    ClusteredPinTimingInvalidator* pin_timing_invalidator,
-                                    SetupTimingInfo* timing_info,
-                                    t_placer_costs* costs) {
+static void initialize_timing_info(float crit_exponent,


Somewhere in this file (options listing? type or variable definition of slacks and criticalities?) you'll need to detail exactly what a criticality and a slack are, and the difference (criticality is relative between 0 and 1, slack is absolute and can be +ve or negative (unrelaxed, raw slack based on timing constraint I think).

vaughnbetz · 2020-08-21T00:45:01Z

vpr/src/place/place.cpp

+}
+
+/* Update the connection_timing_cost values from the temporary *
+ * values for all connections that have changed.               */


Explain this is part of the process of commiting a move.

vaughnbetz · 2020-08-21T00:46:02Z

vpr/src/place/place.cpp

@@ -1840,31 +2163,21 @@ static void revert_td_cost(const t_pl_blocks_to_be_moved& blocks_affected) {
 //
 //Relies on proposed_connection_delay and connection_delay to detect
 //which connections have actually had their delay changed.


Explain purpose of the routine (to figure out which connections have had their delays changed so we can incrementally timing analyze). Do we need to call this after every committed move or every proposed move? Explain a bit more about its use here.

vaughnbetz · 2020-08-21T00:48:37Z

vpr/src/place/place.cpp

@@ -2096,6 +2409,8 @@ static void alloc_and_load_placement_structs(float place_cost_exp,
        connection_delay = make_net_pins_matrix<float>(cluster_ctx.clb_nlist, 0.f);
        proposed_connection_delay = make_net_pins_matrix<float>(cluster_ctx.clb_nlist, 0.f);

+        connection_setup_slack = make_net_pins_matrix<float>(cluster_ctx.clb_nlist, std::numeric_limits<float>::infinity());


Make sure there is a comment somewhere on what this data structure is and give an example of how to access it e.g. matrix of setup slacks on all connections indexed from [0..num_nets-1][1..num_pins_on_net-1]. (check if I got that right).
Can comment where the variable is declared or here, depending on what seems like a better spot.

vaughnbetz · 2020-08-21T00:51:10Z

vpr/src/timing/timing_util.cpp

@@ -579,6 +579,23 @@ float calculate_clb_net_pin_criticality(const SetupTimingInfo& timing_info, cons
    return clb_pin_crit;
 }

+//Return the setup slack of a net's pin in the CLB netlist


Explain that it assumes the timing analysis is up to date (has already been performed with an update call with the right delays).

vaughnbetz · 2020-08-21T00:52:03Z

vpr/src/place/timing_place.cpp

@@ -100,6 +107,77 @@ PlacerCriticalities::pin_range PlacerCriticalities::pins_with_modified_criticali
    return vtr::make_range(cluster_pins_with_modified_criticality_);
 }



Are some of these routines rewrites / replacements for routines in place.cpp? I haven't checked line by line but some seem similar.

vaughnbetz · 2020-08-21T00:55:16Z

Hi Bill,

The code overall looks clean. I've embedded a bunch of requests for comments and feedbacks though.
Themes:

Comment each function and new data structure explaining what it does, and how it is to be used (e.g. do delays and/or the slacks on the timing graph have to be up to date before calling specific routines)?
There are some long argument lists (which predate you, but are getting even longer). Can you factor into a pointer to a higher-level structure to shorten, and would that be useful? This is a bit of a judgement call, but please take a look.
It looks like you've moved some of your code to timing_place.cpp. If you can move as much as possible into timing_place.cpp (or a new file) it would be good, as place.cpp is already too big so avoiding adding to it would be good.

corresponding documentation.

…and try_swap by passing variables using t_annealing_state. Also moved first move_lim determination process to a separate routine

…acer_util.* files.

…pe. Moved delay routines to place_delay_model.*. Moved annealing update routines to place_util.*. Enhanced documentations.

… Enhanced documentation.

Bill-hbrhbr · 2020-08-24T07:15:15Z

Hi @vaughnbetz, I managed to reduce down place.cpp from ~3300 lines to ~2800 lines by moving routines and structures to other files (mainly annealing state variables and timing update routines), and that's with new/enhanced documentation (also in Doxygen style) on a lot of these routines/data structures.
I also shortened the argument list of annealing routines by combining annealing state arguments and costs structures. The file is definitely cleaner than before.
I will continue to update documentation, notably the differences between slacks and criticalities and how to use them. But I will, for now, try not to mess around with try_swap() or placement_inner_loop(), since they are other PRs going on that may affect them.

HackerFoo

I've added some comments, but could you rebase your PR to perform refactoring first in one or more commits, then to add features? It's hard to see what you've added vs. what is just moving code around.

HackerFoo · 2020-08-24T21:04:20Z

libs/libvtrutil/src/vtr_vec_id_set.h

@@ -2,6 +2,7 @@
 #define VTR_SET_H

 #include <vector>
+#include <algorithm>


Why was this added?

After moving around the code, the compiler gave me an error saying error: 'sort' is not a member of 'std' in L79 of this file. That's why I added it.

HackerFoo · 2020-08-24T21:12:17Z

vpr/src/place/place.cpp

+ * place_global.h. These variables were originally local to the current file.  *
+ * However, they were moved so as to facilitate moving some of the routines    *
+ * in the current file into other source files.                                *
+ *******************************************************************************/


It seems like they were exported rather than moved. This seems to expose a lot of details; why not wrap these globals into a singleton (global) object? This will make it easier to duplicate these data structures e.g. to allow parallelism.

I can wrap them into a single global structure and provide accessor functions for them (so that I don't have to declare them as extern), but I don't think there's a need to duplicate them, as they all serve different purposes and are only used by ```place.cpp`` once per VPR flow. Also, I think the placement flow is rather sequential in nature, so I don't know if tbb multithreading can be used to make the placer go faster.

After giving it more thought, I think I'll first leave it as it is in the current PR, and then refactor it into something similar to vpr_context.h in a new PR (something like g_place_ctx or g_place_structs). That will reduce the amount of clutter in the code.

HackerFoo · 2020-08-24T21:20:55Z

vpr/src/place/place_util.cpp

+
+    *rlim *= (1. - 0.44 + success_rat);
+    *rlim = std::max(std::min(*rlim, upper_lim), 1.f);
+}


Missing newline

HackerFoo · 2020-08-24T21:22:11Z

vpr/src/place/place_global.h

+extern vtr::vector<ClusterNetId, double> net_timing_cost;
+extern vtr::vector<ClusterNetId, t_bb> bb_coords, bb_num_on_edges;
+extern vtr::vector<ClusterNetId, t_bb> ts_bb_coord_new, ts_bb_edge_new;
+extern std::vector<ClusterNetId> ts_nets_to_update;


Missing newline

Bill-hbrhbr · 2020-08-24T22:01:25Z

Hi @HackerFoo, this PR got quite big since I did a clean-up of place.cpp this past few days. All of the new features I've added are contained in the

try_swap() routine.
The new PlacerSetupSlacks class in timing_place.h/cpp
and the update_setup_slacks_and_criticalities() in place_timing_update.h/cpp.

My steps before refactoring place.cpp (also writing this down for future reference):

PlacerSetupSlacks is an imitation of the class PlacerCriticalities, and is utilized to get setup slacks from Kevin's Tatum timing analyzer.
I changed the original recompute_criticalities() to update_setup_slacks_and_criticalities() so that both categories of values can be updated.
One problem now occurs. Originally, each time the timing graph is updated by calling timing_info->update(), the criticalities are updated as well. These updates use incremental techniques so that they only checkout modified atom pins returned by the timing analyzer. However, now the timing graph might be updated only for the setup slacks (without having criticalities updated). On the next iteration of criticalities update, it cannot use the incremental technique anymore, since more pins will have been modified than the ones the timing analyzer currently returns. In this case, the criticalities have to be recomputed from scratch (going through each sink pin).
To resolve this issue, I created the t_timing_update_mode class to keep track of the slack/criticality update status. This structure resides in place_timing_update.h. It has a very long and detailed documentation already, so I guess I will not paste it here.
Now with the timing stuff down, I need to implement the setup slack analysis during the placement quench stage. So I made a vpr option --place_quench_metric to ask the user whether or not to turn on the new analysis technique in the quench.
I modified the try_swap() routine so that, if using the slack analysis, it will do a timing_update to get slacks, analyze, and then decide whether or not to accept the move based on the slack improvement. If accepted, the timing update will be kept; if rejected, I revert the timing update so that the slacks in the timing graph returns to their original values.
Note that these timing updates will not involve changing criticalities (hence I had the consideration for point ODIN II: dual_port_ram memory depth not bounded #3 and Area calculation error in routing_stats() #4), and I've made a lot of testings to make sure that the timing update reversion works.

I would also like to refactor before adding changes, but right now a lot of my team members are trying to change place.cpp, so getting that into the master might make the work more complicated for others. However, I believe that I am fairly familiar with place.cpp so that I will be able to resolve merge conflicts rather easily after they get their PRs in.

If it is still inconvenient for you to check and make comments, I can open a draft PR at one of my intermediate commits (before the refactorization) so that you can see only the new features being added. The downside to that is that I haven't written/enhanced a lot of my comments by then. Might be harder to interpret, but might not with what I've written down here.

…ays to the placement global file. ALso fixed a bug with in class static constexpr variable compilation issue.

… in place.cpp.

Bill-hbrhbr · 2020-08-26T22:23:36Z

VTR benchmarks @ commit 74d279c: https://drive.google.com/file/d/1n2Txadj3pXm-CGq6kbEPbdT6hh7RzRb4/view?usp=sharing
Titan benchmarks @ commit 74d279c: https://drive.google.com/file/d/1igBYhXExdT2Uq4JyFeIbWXojWf5NZlS8/view?usp=sharing

The results have slightly degraded somehow. I will run the benchmarks again (with -j # parallel) to see if this is a recurring thing since I highly doubt that it should happen.

On another note, this PR is failing the golden results of benchmarks/blif/4/apex4.blif and benchmarks/blif/4/tseng.blif for arch/bidir/k4_n4_v7_bidir.xml. However, it doesn't fail -check_golden when I run it on Wintermute. So it's kind of strange.

vaughnbetz · 2020-08-27T03:45:32Z

The QoR results (vtr and titan benchmarks) you link above look like a tie to me (a few metrics marginally up a few marginally down) except perhaps Titan placement time which is up ~10% while pack time is up 4% (so machine load may be a factor, but placement slowed down by more than packing). vtr benchmark placement time looks OK but they're much smaller.
Given the code cleanup and new feature I'd probably take the small placement slowdown for now and look for it later if necessary. If you're able to find something with profiling or other means and claw it back that would be great of course!

The QoR failure on apex4 and tseng is likely just CAD noise. It can fail even if the results are better (but by what we consider an unusual amount). I suggest taking a look at the qor compare and see if it looks out of whack, but on those small circuits if we got a change that's bigger than expected it is probably just noise and we can update the golden when you commit. Not failing on wintermute likely means the qor failure comes and goes with slightly different floating point round-off (due to some library or compiler or OS difference) making a different decision somewhere in the flow.

Bill-hbrhbr · 2020-08-28T02:32:30Z

VTR benchmarks @ commit 9f18666: https://drive.google.com/file/d/1GZHxZpBSBssbBNsPnb0aOdoBGRvmk3ia/view?usp=sharing
Titan benchmarks @ commit 9f18666: https://drive.google.com/file/d/1kmeJmhJ1BDdeA1GV8pDCk7j9FfTAYvHn/view?usp=sharing

I ran this set of benchmarks in a more controlled environment, and now the results make more sense: the pack time and the route time have no degradations.

The place time is slower, as expected. Possibly since:

There are more branchings in the try_swap() routine.
The place.cpp crucial data structures are no longer file-scope.

vaughnbetz

A few comments so far ... (need to read further).

vaughnbetz · 2020-08-28T05:18:22Z

vpr/src/base/vpr_types.h

@@ -851,7 +851,8 @@ struct t_annealing_sched {
 * doPlacement: true if placement is supposed to be done in the CAD flow, false otherwise */
 enum e_place_algorithm {


Comment what each enum constant chooses.
I also think we should PATH_DRIVEN_TIMING_PLACE to CRITICALITY_TIMING_PLACE (would be more clear; PATH_DRIVEN is no longer very relevant as all our timing data comes from timing paths).

vaughnbetz · 2020-08-28T05:19:28Z

vpr/src/base/vpr_types.h

@@ -889,6 +890,12 @@ enum class e_place_delta_delay_algorithm {
    DIJKSTRA_EXPANSION,
 };

+enum class e_place_quench_metric {


Comment what each enum value does. Instead of TIMING_COST I think we should call the first one TIMING_CRITICALITY (or TIMING_CRITICALITY_COST if you prefer).

vaughnbetz · 2020-08-28T05:20:04Z

vpr/src/base/vpr_types.h

+    SETUP_SLACK,
+    AUTO
+};
+
 struct t_placer_opts {
    enum e_place_algorithm place_algorithm;


It would be good to add comments for each data member here too.

Bill-hbrhbr · 2020-08-30T08:18:35Z

Close this PR so that it becomes archived.

Bill-hbrhbr added 3 commits July 23, 2020 18:24

Added interface for mapping between CLB pins and setup slacks. Refact…

f4ea4a1

…ored PlacerCriticalities, and created PlacerSetupSlacks, so that they can choose between doing incremental V.S. from scratch updates.

Refactored criticalities update in place.cpp and added setup slacks u…

c024603

…pdate. Added checks to see if the updates need to be done from scratch or can be done incrementally

Fixe up format and compilation errors

cb6e9a6

Bill-hbrhbr requested review from kmurray and vaughnbetz July 23, 2020 23:53

Bill-hbrhbr self-assigned this Jul 23, 2020

probot-autolabeler bot added lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool labels Jul 23, 2020

vaughnbetz reviewed Jul 27, 2020

View reviewed changes

Bill-hbrhbr added 12 commits July 31, 2020 00:41

Merged 3 update routines into 1 single routine

63db2e1

Resolve merge conflicts

aa7b233

Resolve more merge conflicts

e8f73c6

Changed crit_exponent to first_crit_exponent/state.crit_exponent

d80de58

Provided more complete explanation for the record_setup_slacks routine.

d329911

Merge the master branch into PlacerSetupSlacks (updating vtr flow

9056301

script)

Implemented do_setup_slack_cost_analysis: softmax of negative slacks

0e01ed7

Added single move reversion for setup slack analysis(rather than taki…

2e212dc

…ng placement snapshots). Currently experiencing consistency failures. Also updated slack analysis cost function: comparing the worse slack change across all modified clb pins

Corrected the timing update and reversion of setup slack analysis dur…

96e65ba

…ing the placement quench stage. Made commit_td_cost method incremental by only going through sink pins affected by the moved blocks.

Moved four boolean global variables controlling the timing update int…

112bde5

…o a new local structure called t_placer_timing_update_mode to tidy up the code.

Added vpr option --place_quench_metric to turn on/off setup slack ana…

29b55a3

…lysis during placement quench. Possible options are: auto, timing_cost, setup_slack.

vaughnbetz requested a review from HackerFoo August 21, 2020 00:30

vaughnbetz requested changes Aug 21, 2020

View reviewed changes

Bill-hbrhbr added 2 commits August 21, 2020 19:18

Merged t_placer_costs and t_placer_prev_inverse_costs and added

da55abf

corresponding documentation.

Reduced down the argument list for starting_t, placement_inner_loop, …

92c416a

…and try_swap by passing variables using t_annealing_state. Also moved first move_lim determination process to a separate routine

Bill-hbrhbr added 3 commits August 21, 2020 22:20

Moved t_placer_costs and t_annealing_state and related routines to pl…

870eca6

…acer_util.* files.

Changed major place.cpp data structures from file scope to global sco…

a2685c7

…pe. Moved delay routines to place_delay_model.*. Moved annealing update routines to place_util.*. Enhanced documentations.

Moved timing update routines from place.cpp to place_timing_update.*.…

cc4488e

… Enhanced documentation.

probot-autolabeler bot added the libvtrutil label Aug 24, 2020

HackerFoo suggested changes Aug 24, 2020

View reviewed changes

Bill-hbrhbr mentioned this pull request Aug 24, 2020

Add raw setup slack analysis to placement quench #1501

Merged

Bill-hbrhbr added 2 commits August 25, 2020 03:46

Enchanced documentation for timing_place.*. Moved chanx, chany 2d arr…

38f25cc

…ays to the placement global file. ALso fixed a bug with in class static constexpr variable compilation issue.

Added documentation for the timing driven routines used in try_swap()…

74d279c

… in place.cpp.

Merge branch 'master' into PlacerSetupSlacks

9f18666

vaughnbetz reviewed Aug 30, 2020

View reviewed changes

Bill-hbrhbr closed this Aug 30, 2020

		@@ -100,6 +107,77 @@ PlacerCriticalities::pin_range PlacerCriticalities::pins_with_modified_criticali
		return vtr::make_range(cluster_pins_with_modified_criticality_);
		}

		@@ -851,7 +851,8 @@ struct t_annealing_sched {
		* doPlacement: true if placement is supposed to be done in the CAD flow, false otherwise */
		enum e_place_algorithm {

Add PlacerSetupSlacks interface #1450

Add PlacerSetupSlacks interface #1450

Uh oh!

Conversation

Bill-hbrhbr commented Jul 23, 2020

Description

Types of changes

Checklist:

Uh oh!

Bill-hbrhbr commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaughnbetz commented Jul 27, 2020

Uh oh!

Bill-hbrhbr commented Aug 14, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vaughnbetz commented Aug 21, 2020

Uh oh!

Bill-hbrhbr commented Aug 24, 2020

Uh oh!

HackerFoo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bill-hbrhbr Aug 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bill-hbrhbr commented Aug 24, 2020

Uh oh!

Bill-hbrhbr commented Aug 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vaughnbetz commented Aug 27, 2020

Uh oh!

Bill-hbrhbr commented Aug 28, 2020

Uh oh!

vaughnbetz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bill-hbrhbr commented Aug 30, 2020

Uh oh!

Uh oh!

Bill-hbrhbr commented Jul 23, 2020 •

edited

Loading

HackerFoo left a comment •

edited

Loading

Bill-hbrhbr Aug 24, 2020 •

edited

Loading

Bill-hbrhbr commented Aug 26, 2020 •

edited

Loading