Skip to content

Add raw setup slack analysis to placement quench #1501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Sep 9, 2020

Conversation

Bill-hbrhbr
Copy link
Contributor

@Bill-hbrhbr Bill-hbrhbr commented Aug 24, 2020

Related Issue

simplified version of #1450 for easier code review and merging.

Copy of the note in the original PR

Hi @HackerFoo. Please checkout this draft PR to see if it's easier for you to review.
New features:

  1. try_swap() routine along with analyze_setup_slack_cost() routine.
  2. The new PlacerSetupSlacks class in timing_place.h/cpp
  3. Three new timing update related routines: update_timing_classes(), update_timing_cost(), perform_full_timing_update() in place.cpp.
  4. VPR option to turn on setup slack analysis during placement quench: --place_quench_metric {auto, timing_cost, setup_slack}

My steps for this PR:

  1. PlacerSetupSlacks is an imitation of the class PlacerCriticalities, and is utilized to get setup slacks from Kevin's Tatum timing analyzer.

  2. I changed the original recompute_criticalities() to update_setup_slacks_and_criticalities() so that both categories of values can be updated.

  3. One problem now occurs. Originally, each time the timing graph is updated by calling timing_info->update(), the criticalities are updated as well. These updates use incremental techniques so that they only checkout modified atom pins returned by the timing analyzer. However, now the timing graph might be updated only for the setup slacks (without having criticalities updated). On the next iteration of criticalities update, it cannot use the incremental technique anymore, since more pins will have been modified than the ones the timing analyzer currently returns. In this case, the criticalities have to be recomputed from scratch (going through each sink pin).

  4. To resolve this issue, I created the t_timing_update_mode class to keep track of the slack/criticality update status. No documentation on this structure at this commit.

  5. Now with the timing stuff down, I need to implement the setup slack analysis during the placement quench stage. So I made a vpr option --place_quench_metric to ask the user whether or not to turn on the new analysis technique in the quench.

  6. I modified the try_swap() routine so that, if using the slack analysis, it will do a timing_update to get slacks, analyze, and then decide whether or not to accept the move based on the slack improvement. If accepted, the timing update will be kept; if rejected, I revert the timing update so that the slacks in the timing graph returns to their original values.
    Note that these timing updates will not involve changing criticalities (hence I had the consideration for point ODIN II: dual_port_ram memory depth not bounded #3 and Area calculation error in routing_stats() #4), and I've made a lot of testings to make sure that the timing update reversion works.

…ored PlacerCriticalities, and created PlacerSetupSlacks, so that they can choose between doing incremental V.S. from scratch updates.
…pdate. Added checks to see if the updates need to be done from scratch or can be done incrementally
…cks. The matrix update is incremental according to the pins with modified setup slacks returned from PlacerSetupSlacks. Outer loop routine now updates both setup slacks and criticalities, while the inner loop routine passes in variables that determine the strategies/cost functions used to evaluate the effectiveness of try_swap moves.
…a series of successful moves done by try_swap. Right now the data structures representing the state variables are directly being copied, however the process can possibly be optimized with incremental techniques. The snapshot routines are called in the placement's inner loop, and should be used together with VPR options quench_recompute_divider and less optimally inner_recompute_divider. The latter would be too time consuming in practice.
…ng placement snapshots). Currently experiencing consistency failures. Also updated slack analysis cost function: comparing the worse slack change across all modified clb pins
…ing the placement quench stage. Made commit_td_cost method incremental by only going through sink pins affected by the moved blocks.
…o a new local structure called t_placer_timing_update_mode to tidy up the code.
…lysis during placement quench. Possible options are: auto, timing_cost, setup_slack.
@Bill-hbrhbr Bill-hbrhbr requested a review from HackerFoo August 24, 2020 22:26
@Bill-hbrhbr Bill-hbrhbr self-assigned this Aug 24, 2020
@probot-autolabeler probot-autolabeler bot added lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool labels Aug 24, 2020
std::sort(proposed_setup_slacks.begin(), proposed_setup_slacks.end());

//Check the first pair of slack values that are different
//If found, return their difference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the justification for this? Why not the difference between the sum of the original and proposed slacks? This needs explanation.

Copy link
Contributor Author

@Bill-hbrhbr Bill-hbrhbr Aug 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added function level description.

We only check if the worst value among all the values modified has gotten better or worth, since we only care about the critical path.

Of course, this is a very simple cost formulation. If you have a better idea, I'll be happy to implement it.

… PlacerCriticalities and cleaned up related code in placer routines. Enchanced documentation requested by PR comments.
@probot-autolabeler probot-autolabeler bot added the docs Documentation label Aug 28, 2020
Copy link
Contributor Author

@Bill-hbrhbr Bill-hbrhbr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @HackerFoo and @vaughnbetz, first of all, thank you for your detailed and valuable review! I've implemented what you've suggested and highlighted crucial routines in this self-review. Please go over these functions, especially their documentation.
After I gather your approvals as well as satisfactory benchmark results, I will merge this PR into the master and start adding more documentation and refactoring place.cpp.

Edit: will also add a reg test.

Comment on lines +1085 to +1126
/**
* @brief Update timing information based on the current block positions.
*
* Run STA to update the timing info class.
*
* Update the values stored in PlacerCriticalities and PlacerSetupSlacks
* if they are enabled to update. To enable updating, call their respective
* enable_update() method. See their documentation for more detailed info.
*
* If criticalities are updated, the timing driven costs should be updated
* as well by calling update_timing_cost(). Calling this routine to update
* timing_cost will produce round-off error in the long run due to its
* incremental nature, so the timing cost value will be recomputed once in
* a while, via other timing driven routines.
*
* If setup slacks are updated, then normally they should be committed to
* `connection_setup_slack` via commit_setup_slacks() routine. However,
* sometimes new setup slack values are not committed immediately if we
* expect to revert the current timing update in the near future, or if
* we wish to compare the new slack values to the original ones.
*
* All the pins with changed connection delays have already been added into
* the ClusteredPinTimingInvalidator to allow incremental STA update. These
* changed connection delays are a direct result of moved blocks in try_swap().
*/
static void update_timing_classes(float crit_exponent,
SetupTimingInfo* timing_info,
PlacerCriticalities* criticalities,
PlacerSetupSlacks* setup_slacks,
ClusteredPinTimingInvalidator* pin_timing_invalidator) {
/* Run STA to update slacks and adjusted/relaxed criticalities. */
timing_info->update();

//Update placer'criticalities (e.g. sharpen with crit_exponent)
/* Update the placer's criticalities (e.g. sharpen with crit_exponent). */
criticalities->update_criticalities(timing_info, crit_exponent);

//Update connection, net and total timing costs based on new criticalities
/* Update the placer's raw setup slacks. */
setup_slacks->update_setup_slacks(timing_info);

/* Clear invalidation state. */
pin_timing_invalidator->reset();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important documentation for review

Comment on lines +1150 to 1182
/**
* @brief Updates every timing related classes, variables and structures.
*
* This routine exists to reduce code duplication, as the placer routines
* often require updating every timing related stuff.
*
* Updates: SetupTimingInfo, PlacerCriticalities, PlacerSetupSlacks,
* timing_cost, connection_setup_slack.
*/
static void perform_full_timing_update(float crit_exponent,
const PlaceDelayModel* delay_model,
PlacerCriticalities* criticalities,
PlacerSetupSlacks* setup_slacks,
ClusteredPinTimingInvalidator* pin_timing_invalidator,
SetupTimingInfo* timing_info,
t_placer_costs* costs) {
/* Update all timing related classes. */
criticalities->enable_update();
setup_slacks->enable_update();
update_timing_classes(crit_exponent,
timing_info,
criticalities,
setup_slacks,
pin_timing_invalidator);

/* Update the timing cost with new connection criticalities. */
update_timing_cost(delay_model,
criticalities,
&costs->timing_cost);

/* Commit the setup slacks since they are updated. */
commit_setup_slacks(setup_slacks);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important documentation for review

Comment on lines 1958 to 2009
/**
* @brief Check if the setup slack has gotten better or worse due to block swap.
*
* Get all the modified slack values via the PlacerSetupSlacks class, and compare
* then with the original values at these connections. Sort them and compare them
* one by one, and return the difference of the first different pair.
*
* If the new slack value is larger(better), than return a negative value so that
* the move will be accepted. If the new slack value is smaller(worse), return a
* positive value so that the move will be rejected.
*
* If no slack values have changed, then return an arbitrary positive number. A
* move resulting in no change in the slack values should probably be unnecessary.
*
* The sorting is need to prevent in the unlikely circumstances that a bad slack
* value suddenly got very good due to the block move, while a good slack value
* got very bad, perhaps even worse than the original worse slack value.
*/
static float analyze_setup_slack_cost(const PlacerSetupSlacks* setup_slacks) {
const auto& cluster_ctx = g_vpr_ctx.clustering();
const auto& clb_nlist = cluster_ctx.clb_nlist;

//Find the original/proposed setup slacks of pins with modified values
std::vector<float> original_setup_slacks, proposed_setup_slacks;

auto clb_pins_modified = setup_slacks->pins_with_modified_setup_slack();
for (ClusterPinId clb_pin : clb_pins_modified) {
ClusterNetId net_id = clb_nlist.pin_net(clb_pin);
size_t ipin = clb_nlist.pin_net_index(clb_pin);

original_setup_slacks.push_back(connection_setup_slack[net_id][ipin]);
proposed_setup_slacks.push_back(setup_slacks->setup_slack(net_id, ipin));
}

//Sort in ascending order, from the worse slack value to the best
std::sort(original_setup_slacks.begin(), original_setup_slacks.end());
std::sort(proposed_setup_slacks.begin(), proposed_setup_slacks.end());

//Check the first pair of slack values that are different
//If found, return their difference
for (size_t idiff = 0; idiff < original_setup_slacks.size(); ++idiff) {
float slack_diff = original_setup_slacks[idiff] != proposed_setup_slacks[idiff];

if (slack_diff != 0) {
return slack_diff;
}
}

//If all slack values are identical (or no modified slack values),
//reject this move by returning an arbitrary positive number as cost.
return 1;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important documentation for review

Comment on lines 2112 to 2140
/**
* @brief Commit all the setup slack values from the PlacerSetupSlacks
* class to `connection_setup_slack`.
*
* This routine is incremental since it relies on the pins_with_modified_setup_slack()
* to detect which pins need to be updated and which pins do not.
*
* Therefore, it is assumed that this routine is always called immediately after
* each time update_timing_classes() is called with setup slack update enabled.
* Otherwise, pins_with_modified_setup_slack() cannot accurately account for all
* the pins that have their setup slacks changed, making this routine incorrect.
*
* Currently, the only exception to the rule above is when setup slack analysis is used
* during the placement quench. The new setup slacks might be either accepted or
* rejected, so for efficiency reasons, this routine is not called if the slacks are
* rejected in the end. For more detailed info, see the try_swap() routine.
*/
static void commit_setup_slacks(const PlacerSetupSlacks* setup_slacks) {
const auto& clb_nlist = g_vpr_ctx.clustering().clb_nlist;

//Incremental: only go through sink pins with modified setup slack
auto clb_pins_modified = setup_slacks->pins_with_modified_setup_slack();
for (ClusterPinId pin_id : clb_pins_modified) {
ClusterNetId net_id = clb_nlist.pin_net(pin_id);
size_t pin_index_in_net = clb_nlist.pin_net_index(pin_id);

connection_setup_slack[net_id][pin_index_in_net] = setup_slacks->setup_slack(net_id, pin_index_in_net);
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important documentation for review

Comment on lines +2142 to +2165
/**
* @brief Verify that the values in `connection_setup_slack` matches PlacerSetupSlacks.
*
* Return true if all connection values are identical. Otherwise, return false.
*
* Currently, this routine is called to check if the timing update has been successfully
* reverted after a proposed move is rejected when applying setup slack analysis during
* the placement quench. If successful, the setup slacks in PlacerSetupSlacks should be
* the same as the values in `connection_setup_slack` without running commit_setup_slacks().
* For more detailed info, see the try_swap() routine.
*/
static bool verify_connection_setup_slacks(const PlacerSetupSlacks* setup_slacks) {
const auto& clb_nlist = g_vpr_ctx.clustering().clb_nlist;

//Go through every single sink pin to check that the slack values are the same
for (ClusterNetId net_id : clb_nlist.nets()) {
for (size_t ipin = 1; ipin < clb_nlist.net_pins(net_id).size(); ++ipin) {
if (connection_setup_slack[net_id][ipin] != setup_slacks->setup_slack(net_id, ipin)) {
return false;
}
}
}
return true;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important documentation for review

Comment on lines 159 to 176
/**
* @brief PlacerSetupSlacks returns the RAW setup slacks of clustered netlist connection.
*
* Usage
* =====
* PlacerTimingCosts mimics a 2D array of connection timing costs running from:
* [0..cluster_ctx.clb_nlist.nets().size()-1][1..num_pins-1]
* This class mirrors PlacerCriticalities by both its methods and its members. The only
* difference is that this class deals with RAW setup slacks returned by SetupTimingInfo
* rather than criticalities. See the documentation on PlacerCriticalities for more.
*
* RAW setup slacks are unlike criticalities. Their values are not confined between
* 0 and 1. Their values can be either positive or negative.
*
* This class also provides iterating over the clustered netlist connections/pins that
* have modified setup slacks by the last call to update_setup_slacks(). However, this
* utility is mainly used for incrementally committing the setup slack values into the
* structure `connection_setup_slack` used by many placer routines.
*/
class PlacerSetupSlacks {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important documentation for review

Comment on lines 46 to 86

/**
* @brief PlacerCriticalities returns the clustered netlist connection criticalities
* used by the placer ('sharpened' by a criticality exponent).
*
* Usage
* =====
* PlacerCriticalities returns the clustered netlist connection criticalities used by
* the placer ('sharpened' by a criticality exponent). This also serves to map atom
* netlist level criticalites (i.e. on AtomPinIds) to the clustered netlist (i.e.
* ClusterPinIds) used during placement.
* This class also serves to map atom netlist level criticalites (i.e. on AtomPinIds)
* to the clustered netlist (i.e. ClusterPinIds) used during placement.
*
* Criticalities are calculated by calling update_criticalities(), which will
* update criticalities based on the atom netlist connection criticalities provided by
* the passed in SetupTimingInfo. This is done incrementally, based on the modified
* connections/AtomPinIds returned by SetupTimingInfo.
* Criticalities are updated by update_criticalities(), given that `update_enabled` is
* set to true. It will update criticalities based on the atom netlist connection
* criticalities provided by the passed in SetupTimingInfo.
*
* The criticalities of individual connections can then be queried by calling the
* criticality() member function.
* This process can be done incrementally, based on the modified connections/AtomPinIds
* returned by SetupTimingInfo. However, the set returned only reflects the connections
* changed by the last call to the timing info update.
*
* Therefore, if SetupTimingInfo is updated twice in succession without criticalities
* getting updated (update_enabled = false), the returned set cannot account for all
* the connections that have been modified, in which case a recomputation is required.
* Hence, each time update_setup_slacks_and_criticalities() is called, we assign
* `recompute_required` the opposite value of `update_enabled`.
*
* It also supports iterating via pins_with_modified_criticalities() through the
* clustered netlist pins/connections which have had their criticality modified by
* the last call to update_criticalities(), which is useful for incrementally
* re-calculating timing costs.
* This class also maps/transforms the modified atom connections/pins returned by the
* timing info into modified clustered netlist connections/pins after calling
* update_criticalities(). The interface then enables users to iterate over this range
* via pins_with_modified_criticalities(). This is useful for incrementally re-calculating
* the timing costs.
*
* The criticalities of individual connections can then be queried by calling the
* criticality() member function.
*
* Implementation
* ==============
* To support incremental re-calculation the class saves the last criticality exponent
* passed to update_criticalites(). If the next update uses the same exponent criticalities
* can be incrementally updated. Otherwise they must be re-calculated from scratch, since
* a change in exponent changes *all* criticalities.
* To support incremental re-calculation, the class saves the last criticality exponent
* passed to PlacerCriticalities::update_criticalites(). If the next update uses the same
* exponent, criticalities can be incrementally updated. Otherwise, they must be re-calculated
* from scratch, since a change in exponent changes *all* criticalities.
*/
class PlacerCriticalities {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important documentation for review

Comment on lines +585 to +599
/**
* @brief Returns the raw setup slack of a net's pin in the CLB netlist.
* Assumes that the timing graph is correct and up to date.
*/
float calculate_clb_net_pin_setup_slack(const SetupTimingInfo& timing_info, const ClusteredPinAtomPinsLookup& pin_lookup, ClusterPinId clb_pin) {
//There may be multiple atom netlist pins connected to this CLB pin
float clb_pin_setup_slack = std::numeric_limits<float>::infinity();
for (const auto atom_pin : pin_lookup.connected_atom_pins(clb_pin)) {
//Take the worst/minimum of the atom pin slack as the CLB pin slack
clb_pin_setup_slack = std::min(clb_pin_setup_slack, timing_info.setup_pin_slack(atom_pin));
}

return clb_pin_setup_slack;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important documentation for review

@Bill-hbrhbr Bill-hbrhbr marked this pull request as ready for review August 28, 2020 08:52
Comment on lines +1128 to +1148
/**
* @brief Update the timing driven (td) costs.
*
* This routine either uses incremental update_td_costs(), or updates
* from scratch using comp_td_costs(). By default, it is incremental
* by iterating over the set of clustered netlist connections/pins
* returned by PlacerCriticalities::pins_with_modified_criticality().
*
* Hence, this routine should always be called when PlacerCriticalites
* is enabled to be updated in update_timing_classes(). Otherwise, the
* incremental method will no longer be correct.
*/
static void update_timing_cost(const PlaceDelayModel* delay_model,
const PlacerCriticalities* criticalities,
double* timing_cost) {
#ifdef INCR_COMP_TD_COSTS
update_td_costs(delay_model, *criticalities, &costs->timing_cost);
update_td_costs(delay_model, *criticalities, timing_cost);
#else
comp_td_costs(delay_model, *criticalities, &costs->timing_cost);
comp_td_costs(delay_model, *criticalities, timing_cost);
#endif
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important doc for review

@Bill-hbrhbr Bill-hbrhbr changed the title Setup slack analysis in placement quench draft Add raw setup slack analysis to placement quench Aug 28, 2020
…lacement algorithm to be used during quench.
…thm class to efficiently include SLACK_TIMING_PLACE in all branchings related to CRITICALITY_TIMING_PLACE.
…LACK_TIMING_PLACE during placement quench would succeed.
*
* Also supports assignments and comparisons between t_place_algorithm
* and e_place_algorithm so as not to break down previous codes.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the justification for this over the following?

bool is_timing_driven(e_place_algorithm algo) {
  return algo == CRITICALITY_TIMING_PLACE || algo == SLACK_TIMING_PLACE; 
}

What methods do you propose should be added to this class? Do they improve the current code? If so, please add them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The timing driven check is performed in several places other than place.cpp. So I incorporated this check into placer_opts.
  2. There might be future categorizations of placement algorithms/strategies. Currently, there are only two categories (timing and no timing). Having this class will make it easier to describe and implement new categorizations in the future. If you don't think this is necessary, I'll revert the change.

MoveGenerator& move_generator,
t_pl_blocks_to_be_moved& blocks_affected,
SetupTimingInfo* timing_info) {
SetupTimingInfo* timing_info,
const t_place_algorithm& place_algorithm) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overrides placer_opts? It's not clear how this works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

placement_inner_loop() is called for both normal annealing and the quench. The normal annealing now uses placer_opts.place_algorithm and the quench now uses placer_opts.place_quench_algorithm. Instead of creating a new separate function, I have a new parameter place_algorithm passed in to override placer_opts.place_algorithm

const PlacerCriticalities* criticalities,
TimingInfo* timing_info,
PlacerCriticalities* criticalities,
PlacerSetupSlacks* setup_slacks,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

criticalities and setup_slacks should be bundled together because they used together.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to do this in another PR, where I will refactor place.cpp. But if you want it done here, I will implement it. On a side note, how would you prefer naming this new structure?

…ocumentation for options --place_algorithm and --place_quench_algorithm (same as the documentation on developer's page.
@probot-autolabeler probot-autolabeler bot added tests VTR Flow VTR Design Flow (scripts/benchmarks/architectures) labels Sep 4, 2020
void update_criticalities(const SetupTimingInfo* timing_info, float criticality_exponent);

//Override the criticality of a particular connection
///@brief Override the criticality of a particular connection.
void set_criticality(ClusterNetId net, int ipin, float val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change val to crit_val, and ipin to inet_pin_index (or add a comment on what ipin is -- I think it is an index on that net from 0 to fanout).

Copy link
Contributor Author

@Bill-hbrhbr Bill-hbrhbr Sep 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added assertion level 3 check with error message for ipin in timing_place.cpp.

@vaughnbetz
Copy link
Contributor

Thanks Bill. The new code looks very clean and the commenting enhancements throughout are very clear. I have two more files to read (timing_place.cpp and place.cpp, so big ones). Will submit comments on those as soon as I get through them ...

@Bill-hbrhbr
Copy link
Contributor Author

Hi @vaughnbetz, thank you for taking the time to look at the new changes. I will send you an e-mail soon regarding any leftover concerns for this PR.

constexpr bool INCR_UPDATE_CRITICALITIES = true;

/**************************************/
///@brief Use an incremental approach to updating criticalities and setup slacks?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what this means. Does it control incremental update_timing calls, or recomputing only the criticalities and setup slacks for connections whose slack changed in the last timing analysis, or both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safe to remove these flags. We can use incremental updates whenever it's possible now.

void PlacerCriticalities::incr_update_criticalities(const SetupTimingInfo* timing_info) {
cluster_pins_with_modified_criticality_.clear();

for (AtomPinId atom_pin : timing_info->pins_with_modified_setup_criticality()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code and the code to find the pins_with_modified_setup_slack is almost the same, except one uses the timing_info->pins_with_modified_setup_criticality() and the other uses timing_info->pins_with_modified_setup_slack().

  1. Do the timing_info->pins_with_modified_setup_criticality() and pins_with_modified_setup_slack() differ? Why? I expect they are close to the same; could we just use whichever one is the bigger set (maybe criticality, due to slack shifting) to track both slack and criticality pins that need updates? Would simplify the code and likely make it faster if we are using both criticalities and slacks, if indeed the pin update lists are usually the same (less work and less memory footprint).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can just use the bigger of pins_with_modified_criticality_ or pins_with_modified_setup_slack, some data members and some duplicated code can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recommend doing this since they are pre-computed (in slack_evaluation.cpp) and are not necessarily the same. The efforts that go into merging these two sets are not worth it in my opinion, and it might make the code less readable. I agree with your feeling that a lot of codes are duplicated, but I don't have a good way of resolving the issue without sacrificing code readability.

float rlim_escape_fraction,
enum e_place_algorithm place_algorithm,
const t_place_algorithm& place_algorithm,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible that adding arguments to try_swap and the indirection (const ref) to place_algorithm has slowed it down some. I wouldn't expect much impact, but try_swap is called hundreds of millions of times for big circuits. So that's one commit to explore as you binary search for the slowdown.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The slowdown already occurred in the commit before I implemented this algorithm wrapper, so this is likely not the cause of the slowdown.

void recompute_setup_slacks();

///@brief Flag that turns on/off the update_setup_slacks() routine.
bool update_enabled = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this get turned off by an explicit call in place.cpp if the place algorithm doesn't need setup slacks? If not, could be the cause of the CPU increase.

Copy link
Contributor Author

@Bill-hbrhbr Bill-hbrhbr Sep 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I turned it off during my runtime experiments and it has shown that this isn't the cause of the runtime increase.

float rlim_escape_fraction,
enum e_place_algorithm place_algorithm,
const t_place_algorithm& place_algorithm,
float timing_tradeoff) {
/* Picks some block and moves it to another spot. If this spot is *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment should go in front of the function header in the refactored code (you probably did that already).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this in the subsequent PR

//Go through all the sink pins affected
for (ClusterPinId pin_id : blocks_affected.affected_pins) {
ClusterNetId net_id = clb_nlist.pin_net(pin_id);
int ipin = clb_nlist.pin_net_index(pin_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the few code differences in the base (CRITICALITY_PLACE) algorithm in the new code vs. the old code. Code looks fine (cleaner), and affected_pins is pretty much the same pin list the other code would create dynamically. One difference: the old code did not update the timing cost if net_is_ignored(net_id) -- you could try putting that back in. Doesn't seem like it would make much difference but may be worth a try.

Copy link
Contributor Author

@Bill-hbrhbr Bill-hbrhbr Sep 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no difference due to the way affected_pins are gathered.
This structure is populated in try_swap() by calling find_affected_nets_and_update_costs()->update_td_delta_costs().
The ignored nets are skipped in find_affected_nets_and_update_costs().

@vaughnbetz
Copy link
Contributor

The code looks good Bill. I have no major concerns; some detailed comments (mostly about small bits of additional commenting). I only identified one small difference from the original CRITCALITY_PLACE algorithm and the new code, in commit_td_costs (ignored nets aren't ignored). You could try inserting the appropriate if to see if it speeds things up; seems unlikely, but easy to try.
If you can quickly go through my comments I think we can get this merged. After that, looking for the 7% slowdown would probably be best done with a binary search of commits; I don't see any obvious cause of that slowdown either.

@verilog-to-routing verilog-to-routing deleted a comment from vaughnbetz Sep 5, 2020
… description for three routines in place.cpp: find_affected_sink_pins(), commit_td_cost(), invalidate_affected_connection_delays().
@Bill-hbrhbr
Copy link
Contributor Author

Hi @vaughnbetz, I fixed all the issues raised by your comments in the most recent commit. Some comments are unresolved due to more explanation on my side. Also, I added function level header for find_affected_sink_pins(), commit_td_cost(), and invalidate_affected_connection_delays(). Please go through the comments/documentation and see if you have more concerns.

Note: the comment I deleted was not essential.

@Bill-hbrhbr
Copy link
Contributor Author

Hi @vaughnbetz , I finally found the culprit. The runtime increase is due to splitting out the routine find_affected_sink_pins() from the original invalidate_affected_connection_delays(). However, the vector populated by find_affected_sink_pins() is useful if we wish to revert a timing analysis step. I'll think of a way to take care of this issue.

@vaughnbetz
Copy link
Contributor

Excellent, thanks Bill. The changes look good. There is one build failure in Travis CI: there are a few warnings in timing_place.cpp. Can you get rid of those?
The nightly test failure is a QoR failure on one circuit (unrelated); Sarah already has the action to update the golden result for it, so it is not gating.
For the slowdown, do you prefer to merge this PR and then refactor to speed up, or wait for the speed up?

@Bill-hbrhbr
Copy link
Contributor Author

Bill-hbrhbr commented Sep 7, 2020

Hi @vaughnbetz, I can resolve the slow down in this PR. I think the runtime increase is due to the vector::push_back() call in find_affected_sink_pins() requiring a lot of dynamic reallocations.

Currently trying out reserving/pre-allocating enough size for the vector and see if that alone makes the runtime increase go away. I don't really want to remove find_affected_sink_pins() yet, since it is useful for reverting timing analysis if we are using slack analysis in the placement inner loop.

If the vector::reserve() does not work, I'll find a way to skip find_affected_sink_pins() if slack timing analysis is not used. But either way, the code change is not going to be drastic, so I'll do it in this PR. I'm currently running Titan benchmarks on the new code, and this PR will get finished by tomorrow for sure.

@vaughnbetz
Copy link
Contributor

Great, thanks Bill!

…rease. blocks_affected.affected_pins now only stores moved pins that have changed connection delays.
@vaughnbetz
Copy link
Contributor

Thanks Bill. Can you launch a QoR run to make sure all is well, and the placer sped back up?
Also, please check that blocks_affected.affected_pins does not affect the wirelength calculation (update_bb) as if it did we could get a bad wirelength cost estimate by skipping pins with unchanged delay.

@vaughnbetz
Copy link
Contributor

Just checked the update_bb routines and they don't use .affected_pins so that is fine.

@Bill-hbrhbr
Copy link
Contributor Author

Bill-hbrhbr commented Sep 8, 2020

Titan benchmark #run001 @commit 06c9cf6: https://drive.google.com/file/d/1kduvV2um0rjZRi7JVNZGHwCBIMwlATe5/view?usp=sharing

There is some fluctuation in place time so I'm running a second round of Titan to confirm. But it's pretty obvious that the source of the slowdown is gone (and it's completely gone in my multiple test runs on experimental code changes).

@Bill-hbrhbr
Copy link
Contributor Author

Bill-hbrhbr commented Sep 8, 2020

Titan benchmark #run002 @commit 06c9cf6: https://drive.google.com/file/d/1nTu_tpMZdKigYwwJxhS5Go0eUXeudbnx/view?usp=sharing

There is still a tiny degradation. So I decided to look into the VPR output files for bitcoin_miner benchmark. Specifically, I'm looking for the inner placement loop log.

Master: master.txt
New: PR#1501.txt

I see that there are multiple random spikes in placement iteration runtime stats in both files. Also, I (greedily) ran with the option -j8 (so there would be 16 concurrent threads since I always run two compared benchmarks simultaneously). Hence, I think the runtime fluctuation is mainly due to the load on Wintermute, rather than the code.

To be more clear, before removing find_affected_sink_pins(), I saw a consistent overhead in each placement iteration in the log file, but right now that is not the case anymore.

And even if the code is indeed a bit slower, here are my 4 arguments:

  1. It's a negligible and inconsistent overhead.
  2. It might be due to the way the code is compiled with the added branching in the method update_td_delta_cost(). The benefits brought by further micro-optimizations is not worth the time and efforts put into it.
  3. blocks_affected.affected_pins should have fewer elements than before, which can potentially speed up other parts of the code, such as invalidate_affected_connections(). This is what I hoped for, but it didn't happen.
  4. Most importantly, the current code (with modified blocks_affected.affected_pins) is the most elegant way I can think of to incorporate the slack timing analysis option and one-step timing analysis reversion.

@vaughnbetz
Copy link
Contributor

Thanks for tracking this down Bill. I agree this is close enough (1.5% and 2% placer runtime impact in the two runs, which is getting pretty close to noise). The code is more capable and cleaner, so good to have.

@vaughnbetz vaughnbetz merged commit d32ea3e into verilog-to-routing:master Sep 9, 2020
@vaughnbetz
Copy link
Contributor

(FYI, the one regtest failure is a known qor compare update issue; Sarah has updated it on the master already).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation lang-cpp C/C++ code tests VPR VPR FPGA Placement & Routing Tool VTR Flow VTR Design Flow (scripts/benchmarks/architectures)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants