Skip to content

Merging RCV branch into master #1395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 94 commits into from
Sep 10, 2020
Merged

Merging RCV branch into master #1395

merged 94 commits into from
Sep 10, 2020

Conversation

dpbaines
Copy link
Contributor

@dpbaines dpbaines commented Jul 3, 2020

Description

This PR is a work in progress and still has a significant overhead on top of the regular VPR flow

Related Issue

Motivation and Context

How Has This Been Tested?

I tested this in a variety of scenarios on both the titan benchmark suite, as well as qor reg chain. Results of the data are located here:

UPDATED TITAN: https://docs.google.com/spreadsheets/d/10omM-uNtj-8o-3DnEfGh7C6fznX8va63Y85yTmocNqE/edit?usp=sharing
UPDATED QOR CHAIN: https://docs.google.com/spreadsheets/d/1pJdYBgbMMMudm0-0STI7htba0uHOeQ7VXBu1NcMzAME/edit?usp=sharing

As well I analyzed the results of which in a document here:
https://docs.google.com/document/d/1mZoHqO3H5mYvD44dt2-Ai-eIc4rlrHm4dhSMOlLKmqo/edit?usp=sharing

In summary, when RCV is turned off there should be minimal runtime impact due to the extra if statements in the core router code. I've tried to minimize this in a number of ways.

When RCV is turned on, there is a more significant impact on runtime, which is mostly caused by the route budgeting. However even in a lot of beyond worst case hold cases, it still is able to resolve intercluster hold violations mostly. There is still work to be done for larger circuits, as well as possibly dealing with intra cluster hold violations.

Types of changes

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

@probot-autolabeler probot-autolabeler bot added lang-cpp C/C++ code tests VPR VPR FPGA Placement & Routing Tool VTR Flow VTR Design Flow (scripts/benchmarks/architectures) labels Jul 3, 2020
@vaughnbetz
Copy link
Contributor

Thanks David.
Can you attach Th achieved results for the degraded architcture (VTR, 500 ps Th) with and without RCV? I see the cost of RCV, but not the benefit in the data so far since no Th is included.
QoR failures are all on bidir architectures (old), and are on circuit that got much faster in strong and one that got significantly slower in nightly. Seems OK.
The Titan and vtr qor results look OK so far. RCV is failing on larger circuits, so there is probably a problem in the algorithm somewhere but we can tune that after this PR.

/* Extra path data needed by RCV, seperated from t_heap struct for performance reasons
* Can be accessed by a pointer, won't be initialized unless needed
*
* path_rr: The entire partial path up until the route tree
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what this is. Entire partial path until we reach either the net SOURCE or a part of the routing tree that has already been created for other connections for this net.

@@ -2009,7 +2013,7 @@ argparse::ArgumentParser create_arg_parser(std::string prog_name, t_options& arg
" * criticality: Sets the minimum budgets to 0 and the maximum budgets as a function of delay and criticality (net delay/ pin criticality) [EXPERIMENTAL]\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This help on options doesn't seem to match the option values below. Can you update it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing, a simple next PR could be to give RCV it's own distinct option in the router, but for now I'll update this.

@@ -1017,6 +1017,7 @@ enum e_routing_failure_predictor {
};
enum e_routing_budgets_algorithm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment here on what this controls. Either describe the options or refer back to command option help.

PathManager();
~PathManager();

// Checks if the target node exists in the route tree, or current backwards path variable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a comment that explains how RCV can create loops, and that two tests are necessary to prevent loops (check the partial route created for other connections on the net (route_tree_set) and check the prior nodes on the path that is being expanded/explored for a given heap element (partial path) for the current connection. This latter part needs to be stored for every partial path (heap element).

If you don't have such an overview comment, you should add it, probably near the top of route_path_manager.h.

@vaughnbetz
Copy link
Contributor

Did a quick review of the code, which generally looks good. I added a few requests for comments / command line help above.
If you don't have high-level comment blocks on (1) why we need a path manager and how it works (mentioned above) and (2) how RCV works, you should add them. If they exist already, can you point me at them?

@dpbaines
Copy link
Contributor Author

dpbaines commented Sep 8, 2020

@vaughnbetz Hi, as per the Thold values for RCV on and off, they are in the spreadsheet under the Formatted Raw Data section. I included both total hold and worst case hold slacks for each test. Sorry that wasn't noted somewhere!

As per the failures on larger circuits, I can look into whats causing them for a future pull request. Thanks for looking over this, I'll make the changes you suggested right now.

MINIMAX,
YOYO,
MINIMAX, // Use MINIMAX-PERT algorithm to allocate budgets
YOYO, // Use MINIMAX as above, and enable RCV algorithm to resolve negative hold slack
SCALE_DELAY,
DISABLE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should comment DISABLE too as it's an important one: currently turns off RCV and delay budgets and uses default router.

* Having these in t_heap creates significant performance issues when RCV is disabled
* A t_heap_path pointer is instead stored in t_heap, which is selectively allocated only when RCV is enabled
*
* If the _is_enabled flag is true, alloc_path_struct allocates t_heap_path structures, otherwise will be a NOOP */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice comment -- thanks!

Copy link
Contributor

@vaughnbetz vaughnbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, new comments look good. One more request embedded above.
Do you know if the circuits that fail with RCV run out of memory or unroute? Good to describe that here, as well as the experiment conditions (e.g. I think the vtr circuits channel width was 300 or 400 with RCV, but am not sure of the value).

@dpbaines
Copy link
Contributor Author

dpbaines commented Sep 8, 2020

@vaughnbetz For the QOR reg tests, the two circuits that failed for RCV ON were LU8PEEng and LU32PEEng, which didn't have a memory issue. They failed because there was still negative hold slack, and RCV didn't detect that this was probably being caused by global/ignored nets. This is a tunable fix, because I do have a function that is supposed to detect this.

For the Titan tests the following circuits are reported as failed for the following reasons:

Gaussian blur: This failed because I stopped it early as it was taking all weekend, I did keep it running and it actually passed. I can update the doc to include this.

LU_Network: This failed the following assertion:

## Initializing router criticalities took 21.20 seconds (max_rss 9526.5 MiB, delta_rss +90.6 MiB)
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Iter   Time    pres  BBs    Heap  Re-Rtd  Re-Rtd Overused RR Nodes      Wirelength      CPD       sTNS       sWNS       hTNS       hWNS Est Succ
      (sec)     fac Updt    push    Nets   Conns                                       (ns)       (ns)       (ns)       (ns)       (ns)     Iter
---- ------ ------- ---- ------- ------- ------- ----------------- --------------- -------- ---------- ---------- ---------- ---------- --------
Warning 16817: 62356 timing endpoints were not constrained during timing analysis
   1   80.6     0.0    0 6.1e+08  404083  741539  432510 ( 2.904%) 5647923 (26.7%)    7.316 -2.113e+06     -7.316      0.000      0.000      N/A
/home/bainesda/Documents/vtr-verilog-to-routing/vpr/src/timing/PostClusterDelayCalculator.tpp:352 atom_net_delay: Assertion !std::isnan(edge_delay.value())' failed.
Command terminated by signal 6

I haven't encountered this before. This appears to fail during the route budgeting stage with a weird call to tatum. This seems like it's probably an easy fix... I will look into this right now

By the way I'm going to update the golden results for strong_bidir tests.

@vaughnbetz
Copy link
Contributor

Thanks David.

  1. For RCV on, degraded Th, there are four circuits with no results in the clean comparison to maser. But 2 of those (mcml and stereovision0) seem to have completed (but failed hold). Is this just a parsing issue? If it's easy to fill in the missing numbers it would be good to add them in the tables and the averages.

  2. RCV on with degraded hold doesn't resolve hold on 4 circuits (LU8, LU32, mcml and stereovision0). I don't want to try to tune this further at this point, but if you have any idea of why let me know (and add it to your overall document) as it provides a starting point for future work.

  3. Why do LU8 and LU32 not produce full output when Th isn't resolved? It would be better for them to still produce a final routing and statistics if they obtain a legal routing.

  4. LU_Network error: If you have time to look at that, I'd certainly appreciate it. If not, filing an issue would still be helpful. Running the sanitizer build on RCV would be a good idea if you haven't done so; maybe there's a memory fault.

@vaughnbetz
Copy link
Contributor

This looks ready to merge to me. I have some qor questions above, but none seem to be merge gating.

@dpbaines
Copy link
Contributor Author

dpbaines commented Sep 9, 2020

@vaughnbetz To address your 4 concerns:

  1. My bad I copied the document and it didn't auto complete MCML and stereovision circuits, I have filled these numbers in just now.

  2. I made a little comment on the spreadsheet as to why this happened, but adding it to the document is also a good idea! The issues for (MCML and stereovision0) stem from global/ignored nets with hold violations. RCV right now will tell the router to complete if it's detecting that only global/ignored nets have hold issues. These are unresolvable right now, but will still be reported in the hTNS and return a successful routing. This may be the incorrect way to approach it, in which case would be a good idea for a future PR.

  3. As for the other 2 failed circuits (LU8 and LU32), I need to do a bit more investigation. Right now if RCV is active, and the router reaches the maximum iteration count with a none zero hTNS it reports a fail. In which case we don't get a final report. This feature stops the router from finishing before RCV can resolve hold completely, but can result in these failed routings. This can be improved by making RCV smarter at detecting when it's finished, and when it can give up. I'm not sure why it struggles to resolve hold however. This is something I want to look into for the next PR too, or at the very least mention in the document.

  4. It doesn't appear to be due to a memory fault. It's a failed assertion check in tatum, which I think can be traced back to the perform_STA function in route_budgets.cpp. I'll dig into it however it will take some time as it is an outlier case, so it can be work for a future PR. I'll try and get it sorted soon.

@vaughnbetz
Copy link
Contributor

Thanks David. 3 is probably resolvable: basically if the routing is legal we should report success, whether Th is met or not. It's fine to have RCV keep trying to resolve hold, but the final condition of whether this routing is legal or not should understand it is legal and just move on to reporting. Hopefully the code for that is simple / clean.

I agree 4 sounds more involved; sounds like something that can wait.

@vaughnbetz
Copy link
Contributor

Should be clear: 3 and 4 should both be left to future PRs. I just think 3 is a lot easier than 4.

@dpbaines
Copy link
Contributor Author

dpbaines commented Sep 9, 2020

@vaughnbetz thanks for getting back to me, sounds good I'll note these two issues in the document if I don't get around to fixing them soon. I might move to patch 3 right after this.

@dpbaines
Copy link
Contributor Author

dpbaines commented Sep 9, 2020

I updated the golden results for the vtr_reg_nightly/vtr_bidir regression test because it was performing differently. Let me know if you if you disagree with this

@vaughnbetz
Copy link
Contributor

Updating that QoR test looks OK to me.

@vaughnbetz
Copy link
Contributor

Read the docs failed but I can't see how that can possibly be related as no documentation files were changed in this PR. Keith confirmed in the meeting today that he's happy with the new code, as am I. Nice work pulling this together David!

@vaughnbetz vaughnbetz merged commit 7ae49bf into master Sep 10, 2020
@vaughnbetz vaughnbetz deleted the router_rcv_2.0 branch September 10, 2020 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang-cpp C/C++ code tests VPR VPR FPGA Placement & Routing Tool VTR Flow VTR Design Flow (scripts/benchmarks/architectures)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants