RR graph fix up #1448

tangxifan · 2020-07-22T23:18:32Z

Fix-up the bug in the Routing Resource Graph which occurs when an input/output pin locates on multiple sides.

Description

Modify the rr_graph builder to

Allocate only a unique rr_node for those input/output pin locates on multiple sides.
The unique rr_node can be indexed through the fast look-up on different sides.
relax some constraints in check rr_graph to support the fix-up on rr_graph

Related Issue

This pull request is a splitted pull request from #1355 to address the bug found in rr_graph

Motivation and Context

Current rr_graph builder allocates individual rr_node for the input/output pins locating on multiple sides.
This causes that router can map different nets to these rr_nodes which corresponds to the same pin.
In such case, the routing results are illegal.
This bug is observed mainly in the Stratrix4 architecture but not limited to it.

How Has This Been Tested?

Passed regression test with QoR failure. Expect to fix the architecture XML to meet QoR

Types of changes

Bug fix (change which fixes an issue)
New feature (change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My change requires a change to the documentation
I have updated the documentation accordingly
I have added tests to cover my changes
All new and existing tests passed

…ave unique rr_nodes even when they have multiple locations on different sides

…de allocation for each pin

tangxifan · 2020-07-22T23:19:44Z

Hi @kmurray @vaughnbetz
I just created this pull request by splitting the commits on the rr_graph fix-up here.
I believe this could provide a clean start for us to finish the rr_graph patching.

tangxifan · 2020-07-23T19:25:10Z

Hi all,
To help us understand which part of the architecture causes routing congestion, I have checked the log files of the four failed Titan benchmarks.
The intermediate file sizes are typically very large, ranging from 0.9GB to 6.8GB.
After compression, it definitely exceeds the 10MB limit in file uploading.
The .net file, which takes most of diskspace.
Therefore, I attached here the log files, intermediated files of two benchmarks, which I think is small enough but very representative.

neuron: the example benchmark in the Titan tutorial, which passed titan flow-run.
openCV: the smallest benchmark that fails titan flow-run. I see in the log file (vpr_stdout.log) there is 1 node after 400 routing iterations. The 1 node remains in many iterations. Therefore I believe this is an example to look into, through which we can spot where the congestion coming from.

Just a quick look at the logs, I cannot find spot which pin or rr_node causes the congestion. If you can provide more details, I can dig that out.

neuron_stratixiv_arch_timing.blif.tar.gz

openCV_stratixiv_arch_timing.blif.tar.gz

I also attached the vpr log files of the other 3 failures benchmarks here.

failed_titan_benchmarks_vpr_logs.zip

vaughnbetz · 2020-07-23T20:28:23Z

Thanks a lot Xifan.

I took a look through the logs, and it is interesting and encouraging.

circuit overused_rr re-routed nets re-routed conn routing iteration
opencv: 1 6 14 400
bitcoinminer: 10 27 63 400
directrf 2 21 125 400
gaussianblur 3394 8036 39607 43 (qave up)

The 3 new failures are clearly highly localized congestion. I imagine input or output pins (maybe OPINs, given the number of nets getting ripped up and rerouted).

Xifan, if you can get data on what type (IPIN or OPIN, block type, pin #) of rr-nodes are overused at the end of those first three routes (opencv, bitcoinminer, directrf) that would be very useful in narrowing this further. The good news is that it looks like some IPINs or OPINs are too hard to reach (or have too many signals they need to drive) so it's a localized problem that hopefully we can fix in the .arch file.
Xifan, if you can upload the vpr.out files for the 4 circuits above and neuron from the run on the master (no rr-graph change) that would also be very helpful, so we can directly compare router congestion resolution output.

vaughnbetz · 2020-07-24T13:46:47Z

There is more discussion of the rr-graph changes and their impact in #1355

vaughnbetz · 2020-07-24T14:19:04Z

Ideas:
#0: add code to print out exactly what nodes are overused at the end of routing if we have less than N (N defaults to 20, command line option) overused nodes. (Bill).
#1: limit the cluster output count below 100% (say to 80%). Command line option, could change auto meaning if this helps. (Xifan?)
#2: change from instance equivalence on the logic block outputs to full equivalence (overly flexible, but easier to understand). We do instance equivalence but without duplication right now, and we don't model the 2:1 muxes on the outputs. (Xifan?)
#3: Could change the Fc,out values (make 2x higher) to model output 2:1 muxes (Xifan?)
#4: Add input pin equivalence on DSP and RAM blocks in groups of 36 pins (Matches Stratix IV). Kevin's recommendation is to do this with a wrapper / extra level of hierarchy where we have groups of equivalent pins that go into a complete (crossbar) and then drive the current pins. (Mohamed)
#5: Successful changes from above should go into the Titan arch generation script which is currently in the Titan repo (check with Helen). (Mohamed)

The DSP and RAM block model is believed to be overly restrictive right now; just was easier and never caused problems before.

tangxifan · 2020-07-24T17:40:38Z

Hi @vaughnbetz

I attached the vpr.out log files of the 4 failed titan benchmarks using the current master.

titan_master_vpr_out.zip

Regarding to the overused node print-out, it seems that there is some overlap with Bill's work. Should I participate or I should merge Bill's branch?
Regarding to the cluster output utilization, do you expect to modify the flow-run parameters only for the titan regression tests but also for the strong regression tests which use the titan architecture?
I think using full equivalence for LAB outputs is reasonable as the crossbar is already defined in the architecture XML. In this case, the LAB outputs have literally full equivalence. But I am curious why it was defined as instance equivalence before. What is difference between the instance and full when building rr_graph source nodes? (Sorry, I rarely used instance but I am familiar with full).

Here is the XML codes of the titan arch. related to the LAB output connection in the current master. It is a complete crossbar.

<!-- ALM Outputs: directly drive global routing -->
<complete input="alm[9:0].data_out[1:2] alm[9:0].data_out[4:5]" name="LAB_dataout" output="LAB.data_out"/>

As for the 2:1 muxes, I did not see their definition in the arch XML and I think Stratix4 does not have any. Do you expect me to add these muxes? If we increase the Fc_out values, it means that we move the local routing multiplexers to the switch blocks. It may increase the routing area.
I did not see any crossbars for the DSP inputs. Here is the XML codes of the titan arch. related to the DSP data input connection in the current master. We have only direct connections.

<direct input="DSP.data_in[287:144]" name="data_in_bot" output="half_DSP[1].data_in">
  <delay_constant max="239e-12" min="132e-12" in_port="DSP.data_in[287:144]" out_port="half_DSP[1].data_in"/>
</direct>
<direct input="DSP.data_in[143:0]" name="data_in_top" output="half_DSP[0].data_in">
  <delay_constant max="239e-12" min="132e-12" in_port="DSP.data_in[143:0]" out_port="half_DSP[0].data_in"/>
</direct>

Adding the input crossbar will improve the routability but should we recalculate the DSP block area? These crossbars are very area consuming. Similar question for the RAM blocks.

Regarding to the methodology, do you expect to apply all the modifications on my branch (in this pull request) or in separated branches and then merged together?

vaughnbetz · 2020-07-25T02:55:24Z

@Bill-hbrhbr Adding Bill. Xifan, lets get Bill to make the overused rr-node print out feature. I don't want to overload you too much!
@helen-23 Also adding Helen since she is likely to get involved with the Titan architecture changes.

Experiment 1: instance equivalence is less flexible than full equivalence, since you can't have signals go to multiple outputs (at least not without logic duplication, which we don't support). Sounds like that's an easy one, so if you can try changing the output equivalence to "full" and run to get a comparison that would be great Xifan.
Experiment 2: Stratix IV actually has 36:1 crossbars on the inputs to DSP blocks and RAM blocks. I talked to Kevin today, and it wasn't intentional to leave them out -- we just never put them in through forgetting / running out of time. It didn't seem to cause many problems, but now I think it is. So the proper thing to do would be to group the inputs into 36:1 groups with input equivalence and crossbars, for both DSP and RAM blocks. If that is a lot of work, we could do a quick-and-dirty experiment with all the inputs to a RAM and DSP made equivalent.
3: Experiment 3: You are right that increasing Fc,out will increase area, but Stratix IV does have 2:1 muxes between the LE outputs and the routing drivers. Since we have a full crossbar already on the logic block outputs we shouldn't add anymore flexibility there. But for the RAM and DSP blocks we should experiment with doubling Fc,out as it would better approximate having a 2:1 mux between the outputs and the routing drivers. My recollection is that those 2:1 muxes are in those blocks as well, but I am not 100% sure.

Based on the results of those, we could decide which architecture changes to commit (or to combine some and do an experiment 4).

Methodology: Xifan, if you can make the architectures for exp 1 - 3 and run them on your branch, and also pass them to Helen to run on the master (or if it is easier you can also run them on the master) that would be great. I'd like to get QoR data for all those items and then we can decide what to change in the architecture. Helen has some new circuits that she's getting into the Titan flow, and at least one of those showed some strange routing to RAM blocks on the master so I think the lack of input pin equivalence on RAM blocks may already be an issue (maybe the current Titan designs don't have that many RAM blocks?).

vaughnbetz · 2020-07-25T03:06:48Z

Also, experiment 0.5 (before 1 above) is to play with various options for the cluster input and output pin utilization limits.
--target_ext_pin_util has the option. Right now we're using auto, which limits the logic block to 80% input usage when possible (with no other limits). Other interesting cases:

logic block has input and output limits of 0.8
all blocks have input and output limits of 0.8

vaughnbetz · 2020-07-25T03:15:39Z

Filed the reporting overused rr-node issue to Bill: #1453

…multiple sides

tangxifan · 2020-07-28T17:15:14Z

Hi @vaughnbetz
I just finished the experiment 0.5, where I set the LAB input and output utilization rate to 0.8 (--target_ext_pin_util LAB:0.8,0.8).
We have two benchmarks directrf and opencv passing the routing but a new benchmark sparc_T2 failed in routing.
I attached the zip file which includes

A spreadsheet that compares current QoR to the golden on master
vpr.out of the new passed and failed benchmarks
A summary of QoR check failure from parse_vtr_trask.pl

titan_LAB_io_util0.8.zip

tangxifan · 2020-07-29T16:53:52Z

Hi @vaughnbetz
I havefinished another experiment 0.5, where I set all the logic block input and output utilization rate to 0.8 (--target_ext_pin_util 0.8,0.8).
This time, we have a benchmark opencv passing the routing but a new benchmark sparc_T2 failed in routing.
I attached the zip file which includes

A spreadsheet that compares current QoR to the golden on master
vpr.out of the new passed and failed benchmarks
A summary of QoR check failure from parse_vtr_trask.pl

titan_util0.8.zip

tangxifan · 2020-07-30T17:46:44Z

Hi @vaughnbetz
The experiment 1 has been finished. After I used full equivalence to LAB outputs, all the titan benchmarks passed flow-run but have some QoR failures. A short summary on the QoR:

The QoR failures are due to some improvements due to the use of full pin equivalence.
Routing area is reduced by 5% on average
Almost no impact on critical path delay
Runtime overhead is still there. For example, the number of routing iterations is almost doubled.

See detailed in the attached report.
LAB_output_equivalent.zip

vaughnbetz · 2020-07-30T18:19:17Z

This looks great. full instance equivalence definitely looks like a good thing to turn on.

…rilog-to-routing into rr_graph_fix_up

vaughnbetz · 2020-07-30T21:34:29Z

Just looked through the 0.8 input and 0.8 output pin utilization target experiment. It didn't change much: +2% cluster count, +2% placer estimated critical path, +5% routed wirelength, +9% routed critical path delay, +446% route time (probably really really +300% since machine load looks higher due to place time). All comparisons to the master.

vaughnbetz · 2020-07-31T00:29:03Z

I've created a spreadsheet that summarizes the 4 runs so far. LAB output equivalence with the fixed rr-graph is very close to quality parity with the master (with no rr-graph fix). Some metrics are slightly worse, some slightly better. I'll keep this spreadsheet up to date with future results so we can track combinations might be promising, but output equivalence definitely looks good.
titan_arch_xifan.xlsx

tangxifan · 2020-07-31T16:51:20Z

Hi @vaughnbetz I am trying to re-run the titan tests using the latest master with the rr_graph fix-up (no arch. modification, only rr_graph modification), so that we can report overused nodes. But I detect a bug in the flow-run script, where the max_route_iterations is reduced from 400 to 150. An issued has been created at #1458
This is critical to my tests because as you see in the spreadsheet, the required number of routing iteration is more than doubled. As a result, I see most of the benchmarks failed this time. I would like to learn your advice about how to proceed.

Should I wait for a feedback on the bug?
Should I revert back to an old master and perform experiments 3 and 4? In this case, we may not have overused node reports.

vpr/src/route/rr_graph2.cpp

…rilog-to-routing into rr_graph_fix_up

….xml

…ue to rr_graph fix-up

…e Travis/Kokoro shows different QoR than the local machine. Should discuss if the patch is needed or not

tangxifan · 2020-09-14T04:15:11Z

@litghost I have fine-tuned the architecture k6_N10_mem32K_40nm_i_and_o.xml by

forcing 'EMPTY' tiles on the left and right sides
using 'clb' tiles to fill the empty locations in the columns of mem32K and DSP
This improves the routability of the architecture without conflicting what you want to test.
Please let me know if this is fine for you.

FYI, I attached the grid layout (from VPR GUI) after architecture modification here.

Global view of the architecture:
A close view on the architecture:
- You can see that CLBs fill the empty locations in the columns of mem32K and DSP
- CLB start from x=1, where x=0 is now filled by EMPTY tiles.
- Routing architecture surrounds the CLBs as a regular FPGA fabric.

vaughnbetz · 2020-09-14T17:45:50Z

Hi Xifan,

Mixing clbs and RAMs in a column like this is hard to lay out, so it's not normally done in commercial FPGAs. I don't object to testing it in this architecture, but I think you should put a comment at the top of the file explaining what this architecture does that is different and what it is intended to test. (Nobody did that for the dedicated input/output pads that it appears to have, so that's good to list in that comment too).

I don't think this architecture is used much in regtests and not in any automated QoR tests (which is good in my opinion, as with these changes it will be a bit more of an unusual architecture). Is that right?

…shift from fix-up

….xml to clarify its unusual architecture choices.

tangxifan · 2020-09-14T18:16:53Z

@vaughnbetz I agree on this. It will challenge the physical design a lot. This is indeed rare in commercial FPGAs as well as any tape-out we have done. Usually, we properly size the grids so that there are no such EMPTY holes in the DSP/RAM columns. These holes are seen only during the architecture exploration. Therefore, I have added comments to the head of the architecture XML in order to clarify these points. If I miss anything, feel free to let me know.

vaughnbetz · 2020-09-14T18:20:57Z

Thanks Xifan; the new comment is very clear.

tangxifan · 2020-09-15T16:59:08Z

Hi all, I have address all the comments so far. The regression tests now passed. Please let me know if it is good to merge.

litghost · 2020-09-15T17:02:25Z

Arch change looks fine to me

vtr_flow/arch/titan/stratixiv_arch.timing.xml

vaughnbetz · 2020-09-15T19:40:27Z

Looks good.

We can merge this now, but

I have a minor comment to add to the arch file (can go in a different PR).
I think it would be good to run a final Titan QoR sweep of the current master vs. this new PR (is about to become the master) so we have it written down how QoR changed, and hence know how (roughly) to adjust expectations if someone has data compared to the old rr-graph and master. I know you measured that a couple of months ago Xifan, but I think it's important enough to measure again s we commit this.

I suspect we can merge in advance of 1 & 2 though but should do them ASAP after the merge. @tangxifan does that make sense?

tangxifan · 2020-09-15T20:04:18Z

@vaughnbetz

No problem for the comments on the arch file. I can do it now.
I will rerun the titan regression test and collect the QoR. Do you want me to update the golden results for titan? Or just submit a spreadsheet comparing the QoR difference in this PR?

It will take ~1 day to finish and I think we can do it before merging.

…e choice of pin equivalence

vaughnbetz · 2020-09-15T20:47:49Z

Thanks Xifan. I think it would be good to upload a spreadsheet of Titan QoR first, and then if all looks good, update the golden and merge.

tangxifan · 2020-09-16T18:17:31Z

Hi @vaughnbetz
I have finished the regression test for titan benchmark.
In the attached spreadsheet, I compare the QoR results of the rr_graph fix-up branch to the golden results of current master.
You can see that there is zero difference on the averaged critical path delay and an average of 5% improvement on the area.
I also used -check_golden and it said that the QoR passed checks.

titan_parse_results_RRG_fixup_vs_master.xlsx

vaughnbetz · 2020-09-16T18:54:00Z

Thanks Xifan! This looks great, and I don't think we need to do an update golden as everything was close enough anyway. I think this is ready to merge. Let me know if you disagree, otherwise I'll merge it.

tangxifan · 2020-09-16T18:56:51Z

@vaughnbetz Thanks for the advice. I am nothing against it.

vaughnbetz · 2020-09-16T18:58:46Z

Excellent -- glad to land this one!

tangxifan added 6 commits July 22, 2020 17:04

critical bug fix on rr_graph generation. Now input pins (receivers) h…

235a61e

…ave unique rr_nodes even when they have multiple locations on different sides

add comments to clarify the rr_graph builder for unique rr_nodes

bf03503

rephrase rr_graph comments and add check flags to ensure necessary no…

cf114c1

…de allocation for each pin

code format fix in rr_graph source files

e70832f

code format fix

58587d2

show rr_node fan_in and fan_out in GUI message channel

226bf7e

probot-autolabeler bot added lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool labels Jul 22, 2020

tangxifan added 2 commits July 22, 2020 17:40

bug fix due to rebase

e800f3f

code clean-up for sanity and format checking

f69b0bc

vaughnbetz mentioned this pull request Jul 25, 2020

Print out information / report on which rr-nodes are overused at the end of a failed routing #1453

Closed

relax checking codes in rr_graph checker for IPINs/OPINs locating on …

82d034e

…multiple sides

Merge branch 'master' of https://github.com/verilog-to-routing/vtr-ve…

1d0c3f8

…rilog-to-routing into rr_graph_fix_up

tangxifan mentioned this pull request Jul 31, 2020

run_vtr_task script does not take the max_router_iteration setting in Titan task #1458

Closed

HackerFoo reviewed Aug 31, 2020

View reviewed changes

vpr/src/route/rr_graph2.cpp Outdated Show resolved Hide resolved

tangxifan added 4 commits August 31, 2020 18:11

Merge branch 'master' of https://github.com/verilog-to-routing/vtr-ve…

ee757db

…rilog-to-routing into rr_graph_fix_up

Merge branch 'master' of https://github.com/verilog-to-routing/vtr-ve…

7810b59

…rilog-to-routing into rr_graph_fix_up

[architecture file] improve routability for k6_N10_mem32K_40nm_i_or_o…

534e35f

….xml

[Golden Result] update golden results to synchronize with QoR shift d…

e73d896

…ue to rr_graph fix-up

probot-autolabeler bot added the tests label Sep 14, 2020

tangxifan added 2 commits September 13, 2020 18:44

[VPR] remove comparison to boolean in rr_graph

2b49f17

[Golden Result] patch golden results for the analytic placer where th…

fd3ed26

…e Travis/Kokoro shows different QoR than the local machine. Should discuss if the patch is needed or not

tangxifan added 2 commits September 14, 2020 11:48

[Golden Results] Patch the vtr_nightly golden results due to the QoR …

14a154c

…shift from fix-up

[Architecture File] Add important notes for k6_N10_mem32K_40nm_i_or_o…

642f735

….xml to clarify its unusual architecture choices.

vaughnbetz reviewed Sep 15, 2020

View reviewed changes

vtr_flow/arch/titan/stratixiv_arch.timing.xml Show resolved Hide resolved

[Architecture File] Add comments for titan architecture to clarify th…

eaea31b

…e choice of pin equivalence

vaughnbetz merged commit 74c8dbe into verilog-to-routing:master Sep 16, 2020

tangxifan deleted the rr_graph_fix_up branch September 17, 2020 19:02

litghost mentioned this pull request Oct 27, 2020

RRNode side method no longer correct #1577

Closed

litghost mentioned this pull request Nov 11, 2020

CI is broken f4pga/f4pga-arch-defs#1768

Closed

tangxifan mentioned this pull request Dec 30, 2020

Node Sides Patch to Show All the Sides that a RR Node may Appear #1621

Merged

7 tasks

RR graph fix up #1448

RR graph fix up #1448

Uh oh!

Conversation

tangxifan commented Jul 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

tangxifan commented Jul 22, 2020

Uh oh!

tangxifan commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vaughnbetz commented Jul 23, 2020

Uh oh!

vaughnbetz commented Jul 24, 2020

Uh oh!

vaughnbetz commented Jul 24, 2020

Uh oh!

tangxifan commented Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vaughnbetz commented Jul 25, 2020

Uh oh!

vaughnbetz commented Jul 25, 2020

Uh oh!

vaughnbetz commented Jul 25, 2020

Uh oh!

tangxifan commented Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tangxifan commented Jul 29, 2020

Uh oh!

tangxifan commented Jul 30, 2020

Uh oh!

vaughnbetz commented Jul 30, 2020

Uh oh!

vaughnbetz commented Jul 30, 2020

Uh oh!

vaughnbetz commented Jul 31, 2020

Uh oh!

tangxifan commented Jul 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tangxifan commented Sep 14, 2020

Uh oh!

vaughnbetz commented Sep 14, 2020

Uh oh!

tangxifan commented Sep 14, 2020

Uh oh!

vaughnbetz commented Sep 14, 2020

Uh oh!

tangxifan commented Sep 15, 2020

Uh oh!

litghost commented Sep 15, 2020

Uh oh!

Uh oh!

vaughnbetz commented Sep 15, 2020

Uh oh!

tangxifan commented Sep 15, 2020

Uh oh!

vaughnbetz commented Sep 15, 2020

Uh oh!

tangxifan commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vaughnbetz commented Sep 16, 2020

Uh oh!

tangxifan commented Sep 16, 2020

Uh oh!

vaughnbetz commented Sep 16, 2020

Uh oh!

Uh oh!

tangxifan commented Jul 22, 2020 •

edited

Loading

tangxifan commented Jul 23, 2020 •

edited

Loading

tangxifan commented Jul 24, 2020 •

edited

Loading

tangxifan commented Jul 28, 2020 •

edited

Loading

tangxifan commented Jul 31, 2020 •

edited

Loading

tangxifan commented Sep 16, 2020 •

edited

Loading