-
Notifications
You must be signed in to change notification settings - Fork 415
Add PlacerSetupSlacks interface #1450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PlacerSetupSlacks interface #1450
Conversation
…ored PlacerCriticalities, and created PlacerSetupSlacks, so that they can choose between doing incremental V.S. from scratch updates.
…pdate. Added checks to see if the updates need to be done from scratch or can be done incrementally
The current question is that the three new routines in place.cpp:
can be merged into a single routine by utilizing some boolean flags. The tradeoff is between code duplication and branch prediction. So far, only update_criticalities (originally recompute_criticalities) is called, and it is called once in a while (so it shouldn't be that hot of a function). I'm inclined to merge these three routines into a single one, but in the future, we might frequently call these update routines in the placement quench part of the code. |
* clustered netlist pins/connections which have had their setup slacks modified by | ||
* the last call to update_setup_slacks(). | ||
*/ | ||
class PlacerSetupSlacks { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain that these are raw slacks (can be negative, based on the timing constraint specified by the user or the very difficult automatically set timing constraints if no user-specified one exists).
It is possible that this would be cleaner with one function that always updated criticalities and slacks. Once you've played with the algorithm some, you could test if always updating both criticalities and slacks in one routine has a significant cost or not. |
…cks. The matrix update is incremental according to the pins with modified setup slacks returned from PlacerSetupSlacks. Outer loop routine now updates both setup slacks and criticalities, while the inner loop routine passes in variables that determine the strategies/cost functions used to evaluate the effectiveness of try_swap moves.
…a series of successful moves done by try_swap. Right now the data structures representing the state variables are directly being copied, however the process can possibly be optimized with incremental techniques. The snapshot routines are called in the placement's inner loop, and should be used together with VPR options quench_recompute_divider and less optimally inner_recompute_divider. The latter would be too time consuming in practice.
…ng placement snapshots). Currently experiencing consistency failures. Also updated slack analysis cost function: comparing the worse slack change across all modified clb pins
…ing the placement quench stage. Made commit_td_cost method incremental by only going through sink pins affected by the moved blocks.
…o a new local structure called t_placer_timing_update_mode to tidy up the code.
VTR benchmarks @ commit 112bde5: https://drive.google.com/file/d/1JRfVQWIme2yldLiOQBPWuD9R2-akIrl_/view?usp=sharing |
…lysis during placement quench. Possible options are: auto, timing_cost, setup_slack.
vpr/src/place/place.cpp
Outdated
ClusteredPinTimingInvalidator* pin_timing_invalidator, | ||
SetupTimingInfo* timing_info, | ||
t_placer_costs* costs) { | ||
static void initialize_timing_info(float crit_exponent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere in this file (options listing? type or variable definition of slacks and criticalities?) you'll need to detail exactly what a criticality and a slack are, and the difference (criticality is relative between 0 and 1, slack is absolute and can be +ve or negative (unrelaxed, raw slack based on timing constraint I think).
vpr/src/place/place.cpp
Outdated
} | ||
|
||
/* Update the connection_timing_cost values from the temporary * | ||
* values for all connections that have changed. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain this is part of the process of commiting a move.
vpr/src/place/place.cpp
Outdated
@@ -1840,31 +2163,21 @@ static void revert_td_cost(const t_pl_blocks_to_be_moved& blocks_affected) { | |||
// | |||
//Relies on proposed_connection_delay and connection_delay to detect | |||
//which connections have actually had their delay changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain purpose of the routine (to figure out which connections have had their delays changed so we can incrementally timing analyze). Do we need to call this after every committed move or every proposed move? Explain a bit more about its use here.
@@ -2096,6 +2409,8 @@ static void alloc_and_load_placement_structs(float place_cost_exp, | |||
connection_delay = make_net_pins_matrix<float>(cluster_ctx.clb_nlist, 0.f); | |||
proposed_connection_delay = make_net_pins_matrix<float>(cluster_ctx.clb_nlist, 0.f); | |||
|
|||
connection_setup_slack = make_net_pins_matrix<float>(cluster_ctx.clb_nlist, std::numeric_limits<float>::infinity()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure there is a comment somewhere on what this data structure is and give an example of how to access it e.g. matrix of setup slacks on all connections indexed from [0..num_nets-1][1..num_pins_on_net-1]. (check if I got that right).
Can comment where the variable is declared or here, depending on what seems like a better spot.
@@ -579,6 +579,23 @@ float calculate_clb_net_pin_criticality(const SetupTimingInfo& timing_info, cons | |||
return clb_pin_crit; | |||
} | |||
|
|||
//Return the setup slack of a net's pin in the CLB netlist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain that it assumes the timing analysis is up to date (has already been performed with an update call with the right delays).
@@ -100,6 +107,77 @@ PlacerCriticalities::pin_range PlacerCriticalities::pins_with_modified_criticali | |||
return vtr::make_range(cluster_pins_with_modified_criticality_); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are some of these routines rewrites / replacements for routines in place.cpp? I haven't checked line by line but some seem similar.
Hi Bill, The code overall looks clean. I've embedded a bunch of requests for comments and feedbacks though.
|
corresponding documentation.
…and try_swap by passing variables using t_annealing_state. Also moved first move_lim determination process to a separate routine
…acer_util.* files.
…pe. Moved delay routines to place_delay_model.*. Moved annealing update routines to place_util.*. Enhanced documentations.
… Enhanced documentation.
Hi @vaughnbetz, I managed to reduce down place.cpp from ~3300 lines to ~2800 lines by moving routines and structures to other files (mainly annealing state variables and timing update routines), and that's with new/enhanced documentation (also in Doxygen style) on a lot of these routines/data structures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some comments, but could you rebase your PR to perform refactoring first in one or more commits, then to add features? It's hard to see what you've added vs. what is just moving code around.
@@ -2,6 +2,7 @@ | |||
#define VTR_SET_H | |||
|
|||
#include <vector> | |||
#include <algorithm> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After moving around the code, the compiler gave me an error saying error: 'sort' is not a member of 'std'
in L79 of this file. That's why I added it.
* place_global.h. These variables were originally local to the current file. * | ||
* However, they were moved so as to facilitate moving some of the routines * | ||
* in the current file into other source files. * | ||
*******************************************************************************/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like they were exported rather than moved. This seems to expose a lot of details; why not wrap these globals into a singleton (global) object? This will make it easier to duplicate these data structures e.g. to allow parallelism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can wrap them into a single global structure and provide accessor functions for them (so that I don't have to declare them as extern), but I don't think there's a need to duplicate them, as they all serve different purposes and are only used by ```place.cpp`` once per VPR flow. Also, I think the placement flow is rather sequential in nature, so I don't know if tbb multithreading can be used to make the placer go faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After giving it more thought, I think I'll first leave it as it is in the current PR, and then refactor it into something similar to vpr_context.h in a new PR (something like g_place_ctx or g_place_structs). That will reduce the amount of clutter in the code.
vpr/src/place/place_util.cpp
Outdated
|
||
*rlim *= (1. - 0.44 + success_rat); | ||
*rlim = std::max(std::min(*rlim, upper_lim), 1.f); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing newline
vpr/src/place/place_global.h
Outdated
extern vtr::vector<ClusterNetId, double> net_timing_cost; | ||
extern vtr::vector<ClusterNetId, t_bb> bb_coords, bb_num_on_edges; | ||
extern vtr::vector<ClusterNetId, t_bb> ts_bb_coord_new, ts_bb_edge_new; | ||
extern std::vector<ClusterNetId> ts_nets_to_update; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing newline
Hi @HackerFoo, this PR got quite big since I did a clean-up of
My steps before refactoring
I would also like to refactor before adding changes, but right now a lot of my team members are trying to change If it is still inconvenient for you to check and make comments, I can open a draft PR at one of my intermediate commits (before the refactorization) so that you can see only the new features being added. The downside to that is that I haven't written/enhanced a lot of my comments by then. Might be harder to interpret, but might not with what I've written down here. |
…ays to the placement global file. ALso fixed a bug with in class static constexpr variable compilation issue.
VTR benchmarks @ commit 74d279c: https://drive.google.com/file/d/1n2Txadj3pXm-CGq6kbEPbdT6hh7RzRb4/view?usp=sharing The results have slightly degraded somehow. I will run the benchmarks again (with -j # parallel) to see if this is a recurring thing since I highly doubt that it should happen. On another note, this PR is failing the golden results of |
The QoR results (vtr and titan benchmarks) you link above look like a tie to me (a few metrics marginally up a few marginally down) except perhaps Titan placement time which is up ~10% while pack time is up 4% (so machine load may be a factor, but placement slowed down by more than packing). vtr benchmark placement time looks OK but they're much smaller. The QoR failure on apex4 and tseng is likely just CAD noise. It can fail even if the results are better (but by what we consider an unusual amount). I suggest taking a look at the qor compare and see if it looks out of whack, but on those small circuits if we got a change that's bigger than expected it is probably just noise and we can update the golden when you commit. Not failing on wintermute likely means the qor failure comes and goes with slightly different floating point round-off (due to some library or compiler or OS difference) making a different decision somewhere in the flow. |
VTR benchmarks @ commit 9f18666: https://drive.google.com/file/d/1GZHxZpBSBssbBNsPnb0aOdoBGRvmk3ia/view?usp=sharing I ran this set of benchmarks in a more controlled environment, and now the results make more sense: the pack time and the route time have no degradations. The place time is slower, as expected. Possibly since:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments so far ... (need to read further).
@@ -851,7 +851,8 @@ struct t_annealing_sched { | |||
* doPlacement: true if placement is supposed to be done in the CAD flow, false otherwise */ | |||
enum e_place_algorithm { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment what each enum constant chooses.
I also think we should PATH_DRIVEN_TIMING_PLACE to CRITICALITY_TIMING_PLACE (would be more clear; PATH_DRIVEN is no longer very relevant as all our timing data comes from timing paths).
@@ -889,6 +890,12 @@ enum class e_place_delta_delay_algorithm { | |||
DIJKSTRA_EXPANSION, | |||
}; | |||
|
|||
enum class e_place_quench_metric { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment what each enum value does. Instead of TIMING_COST I think we should call the first one TIMING_CRITICALITY (or TIMING_CRITICALITY_COST if you prefer).
SETUP_SLACK, | ||
AUTO | ||
}; | ||
|
||
struct t_placer_opts { | ||
enum e_place_algorithm place_algorithm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add comments for each data member here too.
Close this PR so that it becomes archived. |
Description
Add an interface for a 1-to-1 mapping between CLB pins and setup slacks from the timing analyzer. Refactored place.cpp (replaced recompute_criticalities with three new routines) so that setup slacks and criticalities can be updated together or separately.
Types of changes
Checklist: