Skip to content

Add PlacerSetupSlacks interface #1450

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Bill-hbrhbr
Copy link
Contributor

Description

Add an interface for a 1-to-1 mapping between CLB pins and setup slacks from the timing analyzer. Refactored place.cpp (replaced recompute_criticalities with three new routines) so that setup slacks and criticalities can be updated together or separately.

Types of changes

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

…ored PlacerCriticalities, and created PlacerSetupSlacks, so that they can choose between doing incremental V.S. from scratch updates.
…pdate. Added checks to see if the updates need to be done from scratch or can be done incrementally
@Bill-hbrhbr Bill-hbrhbr self-assigned this Jul 23, 2020
@probot-autolabeler probot-autolabeler bot added lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool labels Jul 23, 2020
@Bill-hbrhbr
Copy link
Contributor Author

Bill-hbrhbr commented Jul 23, 2020

The current question is that the three new routines in place.cpp:

  1. update_setup_slacks
  2. update_criticalities
  3. update_setup_slacks_and_criticalities

can be merged into a single routine by utilizing some boolean flags.

The tradeoff is between code duplication and branch prediction. So far, only update_criticalities (originally recompute_criticalities) is called, and it is called once in a while (so it shouldn't be that hot of a function).

I'm inclined to merge these three routines into a single one, but in the future, we might frequently call these update routines in the placement quench part of the code.

* clustered netlist pins/connections which have had their setup slacks modified by
* the last call to update_setup_slacks().
*/
class PlacerSetupSlacks {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain that these are raw slacks (can be negative, based on the timing constraint specified by the user or the very difficult automatically set timing constraints if no user-specified one exists).

@vaughnbetz
Copy link
Contributor

It is possible that this would be cleaner with one function that always updated criticalities and slacks. Once you've played with the algorithm some, you could test if always updating both criticalities and slacks in one routine has a significant cost or not.

…cks. The matrix update is incremental according to the pins with modified setup slacks returned from PlacerSetupSlacks. Outer loop routine now updates both setup slacks and criticalities, while the inner loop routine passes in variables that determine the strategies/cost functions used to evaluate the effectiveness of try_swap moves.
…a series of successful moves done by try_swap. Right now the data structures representing the state variables are directly being copied, however the process can possibly be optimized with incremental techniques. The snapshot routines are called in the placement's inner loop, and should be used together with VPR options quench_recompute_divider and less optimally inner_recompute_divider. The latter would be too time consuming in practice.
…ng placement snapshots). Currently experiencing consistency failures. Also updated slack analysis cost function: comparing the worse slack change across all modified clb pins
…ing the placement quench stage. Made commit_td_cost method incremental by only going through sink pins affected by the moved blocks.
…o a new local structure called t_placer_timing_update_mode to tidy up the code.
@Bill-hbrhbr
Copy link
Contributor Author

…lysis during placement quench. Possible options are: auto, timing_cost, setup_slack.
@vaughnbetz vaughnbetz requested a review from HackerFoo August 21, 2020 00:30
ClusteredPinTimingInvalidator* pin_timing_invalidator,
SetupTimingInfo* timing_info,
t_placer_costs* costs) {
static void initialize_timing_info(float crit_exponent,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere in this file (options listing? type or variable definition of slacks and criticalities?) you'll need to detail exactly what a criticality and a slack are, and the difference (criticality is relative between 0 and 1, slack is absolute and can be +ve or negative (unrelaxed, raw slack based on timing constraint I think).

}

/* Update the connection_timing_cost values from the temporary *
* values for all connections that have changed. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain this is part of the process of commiting a move.

@@ -1840,31 +2163,21 @@ static void revert_td_cost(const t_pl_blocks_to_be_moved& blocks_affected) {
//
//Relies on proposed_connection_delay and connection_delay to detect
//which connections have actually had their delay changed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain purpose of the routine (to figure out which connections have had their delays changed so we can incrementally timing analyze). Do we need to call this after every committed move or every proposed move? Explain a bit more about its use here.

@@ -2096,6 +2409,8 @@ static void alloc_and_load_placement_structs(float place_cost_exp,
connection_delay = make_net_pins_matrix<float>(cluster_ctx.clb_nlist, 0.f);
proposed_connection_delay = make_net_pins_matrix<float>(cluster_ctx.clb_nlist, 0.f);

connection_setup_slack = make_net_pins_matrix<float>(cluster_ctx.clb_nlist, std::numeric_limits<float>::infinity());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure there is a comment somewhere on what this data structure is and give an example of how to access it e.g. matrix of setup slacks on all connections indexed from [0..num_nets-1][1..num_pins_on_net-1]. (check if I got that right).
Can comment where the variable is declared or here, depending on what seems like a better spot.

@@ -579,6 +579,23 @@ float calculate_clb_net_pin_criticality(const SetupTimingInfo& timing_info, cons
return clb_pin_crit;
}

//Return the setup slack of a net's pin in the CLB netlist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain that it assumes the timing analysis is up to date (has already been performed with an update call with the right delays).

@@ -100,6 +107,77 @@ PlacerCriticalities::pin_range PlacerCriticalities::pins_with_modified_criticali
return vtr::make_range(cluster_pins_with_modified_criticality_);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are some of these routines rewrites / replacements for routines in place.cpp? I haven't checked line by line but some seem similar.

@vaughnbetz
Copy link
Contributor

Hi Bill,

The code overall looks clean. I've embedded a bunch of requests for comments and feedbacks though.
Themes:

  • Comment each function and new data structure explaining what it does, and how it is to be used (e.g. do delays and/or the slacks on the timing graph have to be up to date before calling specific routines)?
  • There are some long argument lists (which predate you, but are getting even longer). Can you factor into a pointer to a higher-level structure to shorten, and would that be useful? This is a bit of a judgement call, but please take a look.
  • It looks like you've moved some of your code to timing_place.cpp. If you can move as much as possible into timing_place.cpp (or a new file) it would be good, as place.cpp is already too big so avoiding adding to it would be good.

…and try_swap by passing variables using t_annealing_state. Also moved first move_lim determination process to a separate routine
…pe. Moved delay routines to place_delay_model.*. Moved annealing update routines to place_util.*. Enhanced documentations.
@Bill-hbrhbr
Copy link
Contributor Author

Hi @vaughnbetz, I managed to reduce down place.cpp from ~3300 lines to ~2800 lines by moving routines and structures to other files (mainly annealing state variables and timing update routines), and that's with new/enhanced documentation (also in Doxygen style) on a lot of these routines/data structures.
I also shortened the argument list of annealing routines by combining annealing state arguments and costs structures. The file is definitely cleaner than before.
I will continue to update documentation, notably the differences between slacks and criticalities and how to use them. But I will, for now, try not to mess around with try_swap() or placement_inner_loop(), since they are other PRs going on that may affect them.

Copy link
Contributor

@HackerFoo HackerFoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments, but could you rebase your PR to perform refactoring first in one or more commits, then to add features? It's hard to see what you've added vs. what is just moving code around.

@@ -2,6 +2,7 @@
#define VTR_SET_H

#include <vector>
#include <algorithm>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After moving around the code, the compiler gave me an error saying error: 'sort' is not a member of 'std' in L79 of this file. That's why I added it.

* place_global.h. These variables were originally local to the current file. *
* However, they were moved so as to facilitate moving some of the routines *
* in the current file into other source files. *
*******************************************************************************/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like they were exported rather than moved. This seems to expose a lot of details; why not wrap these globals into a singleton (global) object? This will make it easier to duplicate these data structures e.g. to allow parallelism.

Copy link
Contributor Author

@Bill-hbrhbr Bill-hbrhbr Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can wrap them into a single global structure and provide accessor functions for them (so that I don't have to declare them as extern), but I don't think there's a need to duplicate them, as they all serve different purposes and are only used by ```place.cpp`` once per VPR flow. Also, I think the placement flow is rather sequential in nature, so I don't know if tbb multithreading can be used to make the placer go faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After giving it more thought, I think I'll first leave it as it is in the current PR, and then refactor it into something similar to vpr_context.h in a new PR (something like g_place_ctx or g_place_structs). That will reduce the amount of clutter in the code.


*rlim *= (1. - 0.44 + success_rat);
*rlim = std::max(std::min(*rlim, upper_lim), 1.f);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline

extern vtr::vector<ClusterNetId, double> net_timing_cost;
extern vtr::vector<ClusterNetId, t_bb> bb_coords, bb_num_on_edges;
extern vtr::vector<ClusterNetId, t_bb> ts_bb_coord_new, ts_bb_edge_new;
extern std::vector<ClusterNetId> ts_nets_to_update;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline

@Bill-hbrhbr
Copy link
Contributor Author

Hi @HackerFoo, this PR got quite big since I did a clean-up of place.cpp this past few days. All of the new features I've added are contained in the

  1. try_swap() routine.
  2. The new PlacerSetupSlacks class in timing_place.h/cpp
  3. and the update_setup_slacks_and_criticalities() in place_timing_update.h/cpp.

My steps before refactoring place.cpp (also writing this down for future reference):

  1. PlacerSetupSlacks is an imitation of the class PlacerCriticalities, and is utilized to get setup slacks from Kevin's Tatum timing analyzer.

  2. I changed the original recompute_criticalities() to update_setup_slacks_and_criticalities() so that both categories of values can be updated.

  3. One problem now occurs. Originally, each time the timing graph is updated by calling timing_info->update(), the criticalities are updated as well. These updates use incremental techniques so that they only checkout modified atom pins returned by the timing analyzer. However, now the timing graph might be updated only for the setup slacks (without having criticalities updated). On the next iteration of criticalities update, it cannot use the incremental technique anymore, since more pins will have been modified than the ones the timing analyzer currently returns. In this case, the criticalities have to be recomputed from scratch (going through each sink pin).

  4. To resolve this issue, I created the t_timing_update_mode class to keep track of the slack/criticality update status. This structure resides in place_timing_update.h. It has a very long and detailed documentation already, so I guess I will not paste it here.

  5. Now with the timing stuff down, I need to implement the setup slack analysis during the placement quench stage. So I made a vpr option --place_quench_metric to ask the user whether or not to turn on the new analysis technique in the quench.

  6. I modified the try_swap() routine so that, if using the slack analysis, it will do a timing_update to get slacks, analyze, and then decide whether or not to accept the move based on the slack improvement. If accepted, the timing update will be kept; if rejected, I revert the timing update so that the slacks in the timing graph returns to their original values.
    Note that these timing updates will not involve changing criticalities (hence I had the consideration for point ODIN II: dual_port_ram memory depth not bounded #3 and Area calculation error in routing_stats() #4), and I've made a lot of testings to make sure that the timing update reversion works.

I would also like to refactor before adding changes, but right now a lot of my team members are trying to change place.cpp, so getting that into the master might make the work more complicated for others. However, I believe that I am fairly familiar with place.cpp so that I will be able to resolve merge conflicts rather easily after they get their PRs in.

If it is still inconvenient for you to check and make comments, I can open a draft PR at one of my intermediate commits (before the refactorization) so that you can see only the new features being added. The downside to that is that I haven't written/enhanced a lot of my comments by then. Might be harder to interpret, but might not with what I've written down here.

…ays to the placement global file. ALso fixed a bug with in class static constexpr variable compilation issue.
@Bill-hbrhbr
Copy link
Contributor Author

Bill-hbrhbr commented Aug 26, 2020

VTR benchmarks @ commit 74d279c: https://drive.google.com/file/d/1n2Txadj3pXm-CGq6kbEPbdT6hh7RzRb4/view?usp=sharing
Titan benchmarks @ commit 74d279c: https://drive.google.com/file/d/1igBYhXExdT2Uq4JyFeIbWXojWf5NZlS8/view?usp=sharing

The results have slightly degraded somehow. I will run the benchmarks again (with -j # parallel) to see if this is a recurring thing since I highly doubt that it should happen.

On another note, this PR is failing the golden results of benchmarks/blif/4/apex4.blif and benchmarks/blif/4/tseng.blif for arch/bidir/k4_n4_v7_bidir.xml. However, it doesn't fail -check_golden when I run it on Wintermute. So it's kind of strange.

@vaughnbetz
Copy link
Contributor

The QoR results (vtr and titan benchmarks) you link above look like a tie to me (a few metrics marginally up a few marginally down) except perhaps Titan placement time which is up ~10% while pack time is up 4% (so machine load may be a factor, but placement slowed down by more than packing). vtr benchmark placement time looks OK but they're much smaller.
Given the code cleanup and new feature I'd probably take the small placement slowdown for now and look for it later if necessary. If you're able to find something with profiling or other means and claw it back that would be great of course!

The QoR failure on apex4 and tseng is likely just CAD noise. It can fail even if the results are better (but by what we consider an unusual amount). I suggest taking a look at the qor compare and see if it looks out of whack, but on those small circuits if we got a change that's bigger than expected it is probably just noise and we can update the golden when you commit. Not failing on wintermute likely means the qor failure comes and goes with slightly different floating point round-off (due to some library or compiler or OS difference) making a different decision somewhere in the flow.

@Bill-hbrhbr
Copy link
Contributor Author

VTR benchmarks @ commit 9f18666: https://drive.google.com/file/d/1GZHxZpBSBssbBNsPnb0aOdoBGRvmk3ia/view?usp=sharing
Titan benchmarks @ commit 9f18666: https://drive.google.com/file/d/1kmeJmhJ1BDdeA1GV8pDCk7j9FfTAYvHn/view?usp=sharing

I ran this set of benchmarks in a more controlled environment, and now the results make more sense: the pack time and the route time have no degradations.

The place time is slower, as expected. Possibly since:

  1. There are more branchings in the try_swap() routine.
  2. The place.cpp crucial data structures are no longer file-scope.

Copy link
Contributor

@vaughnbetz vaughnbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments so far ... (need to read further).

@@ -851,7 +851,8 @@ struct t_annealing_sched {
* doPlacement: true if placement is supposed to be done in the CAD flow, false otherwise */
enum e_place_algorithm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment what each enum constant chooses.
I also think we should PATH_DRIVEN_TIMING_PLACE to CRITICALITY_TIMING_PLACE (would be more clear; PATH_DRIVEN is no longer very relevant as all our timing data comes from timing paths).

@@ -889,6 +890,12 @@ enum class e_place_delta_delay_algorithm {
DIJKSTRA_EXPANSION,
};

enum class e_place_quench_metric {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment what each enum value does. Instead of TIMING_COST I think we should call the first one TIMING_CRITICALITY (or TIMING_CRITICALITY_COST if you prefer).

SETUP_SLACK,
AUTO
};

struct t_placer_opts {
enum e_place_algorithm place_algorithm;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add comments for each data member here too.

@Bill-hbrhbr
Copy link
Contributor Author

Close this PR so that it becomes archived.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang-cpp C/C++ code libvtrutil VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants