Skip to content

RR graph fix up #1448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Sep 16, 2020
Merged

Conversation

tangxifan
Copy link
Contributor

@tangxifan tangxifan commented Jul 22, 2020

Fix-up the bug in the Routing Resource Graph which occurs when an input/output pin locates on multiple sides.

Description

Modify the rr_graph builder to

  • Allocate only a unique rr_node for those input/output pin locates on multiple sides.
  • The unique rr_node can be indexed through the fast look-up on different sides.
  • relax some constraints in check rr_graph to support the fix-up on rr_graph

Related Issue

This pull request is a splitted pull request from #1355 to address the bug found in rr_graph

Motivation and Context

Current rr_graph builder allocates individual rr_node for the input/output pins locating on multiple sides.
This causes that router can map different nets to these rr_nodes which corresponds to the same pin.
In such case, the routing results are illegal.
This bug is observed mainly in the Stratrix4 architecture but not limited to it.

How Has This Been Tested?

Passed regression test with QoR failure. Expect to fix the architecture XML to meet QoR

Types of changes

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

@probot-autolabeler probot-autolabeler bot added lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool labels Jul 22, 2020
@tangxifan
Copy link
Contributor Author

Hi @kmurray @vaughnbetz
I just created this pull request by splitting the commits on the rr_graph fix-up here.
I believe this could provide a clean start for us to finish the rr_graph patching.

@tangxifan
Copy link
Contributor Author

tangxifan commented Jul 23, 2020

Hi all,
To help us understand which part of the architecture causes routing congestion, I have checked the log files of the four failed Titan benchmarks.
The intermediate file sizes are typically very large, ranging from 0.9GB to 6.8GB.
After compression, it definitely exceeds the 10MB limit in file uploading.
The .net file, which takes most of diskspace.
Therefore, I attached here the log files, intermediated files of two benchmarks, which I think is small enough but very representative.

  • neuron: the example benchmark in the Titan tutorial, which passed titan flow-run.

  • openCV: the smallest benchmark that fails titan flow-run. I see in the log file (vpr_stdout.log) there is 1 node after 400 routing iterations. The 1 node remains in many iterations. Therefore I believe this is an example to look into, through which we can spot where the congestion coming from.

    Just a quick look at the logs, I cannot find spot which pin or rr_node causes the congestion. If you can provide more details, I can dig that out.

neuron_stratixiv_arch_timing.blif.tar.gz

openCV_stratixiv_arch_timing.blif.tar.gz

I also attached the vpr log files of the other 3 failures benchmarks here.

failed_titan_benchmarks_vpr_logs.zip

@vaughnbetz
Copy link
Contributor

Thanks a lot Xifan.

I took a look through the logs, and it is interesting and encouraging.

circuit overused_rr re-routed nets re-routed conn routing iteration
opencv: 1 6 14 400
bitcoinminer: 10 27 63 400
directrf 2 21 125 400
gaussianblur 3394 8036 39607 43 (qave up)

The 3 new failures are clearly highly localized congestion. I imagine input or output pins (maybe OPINs, given the number of nets getting ripped up and rerouted).

  1. Xifan, if you can get data on what type (IPIN or OPIN, block type, pin #) of rr-nodes are overused at the end of those first three routes (opencv, bitcoinminer, directrf) that would be very useful in narrowing this further. The good news is that it looks like some IPINs or OPINs are too hard to reach (or have too many signals they need to drive) so it's a localized problem that hopefully we can fix in the .arch file.

  2. Xifan, if you can upload the vpr.out files for the 4 circuits above and neuron from the run on the master (no rr-graph change) that would also be very helpful, so we can directly compare router congestion resolution output.

@vaughnbetz
Copy link
Contributor

There is more discussion of the rr-graph changes and their impact in #1355

@vaughnbetz
Copy link
Contributor

Ideas:
#0: add code to print out exactly what nodes are overused at the end of routing if we have less than N (N defaults to 20, command line option) overused nodes. (Bill).
#1: limit the cluster output count below 100% (say to 80%). Command line option, could change auto meaning if this helps. (Xifan?)
#2: change from instance equivalence on the logic block outputs to full equivalence (overly flexible, but easier to understand). We do instance equivalence but without duplication right now, and we don't model the 2:1 muxes on the outputs. (Xifan?)
#3: Could change the Fc,out values (make 2x higher) to model output 2:1 muxes (Xifan?)
#4: Add input pin equivalence on DSP and RAM blocks in groups of 36 pins (Matches Stratix IV). Kevin's recommendation is to do this with a wrapper / extra level of hierarchy where we have groups of equivalent pins that go into a complete (crossbar) and then drive the current pins. (Mohamed)
#5: Successful changes from above should go into the Titan arch generation script which is currently in the Titan repo (check with Helen). (Mohamed)

The DSP and RAM block model is believed to be overly restrictive right now; just was easier and never caused problems before.

@tangxifan
Copy link
Contributor Author

tangxifan commented Jul 24, 2020

Hi @vaughnbetz

  • I attached the vpr.out log files of the 4 failed titan benchmarks using the current master.

titan_master_vpr_out.zip

  • Regarding to the overused node print-out, it seems that there is some overlap with Bill's work. Should I participate or I should merge Bill's branch?
  • Regarding to the cluster output utilization, do you expect to modify the flow-run parameters only for the titan regression tests but also for the strong regression tests which use the titan architecture?
  • I think using full equivalence for LAB outputs is reasonable as the crossbar is already defined in the architecture XML. In this case, the LAB outputs have literally full equivalence. But I am curious why it was defined as instance equivalence before. What is difference between the instance and full when building rr_graph source nodes? (Sorry, I rarely used instance but I am familiar with full).

Here is the XML codes of the titan arch. related to the LAB output connection in the current master. It is a complete crossbar.

<!-- ALM Outputs: directly drive global routing -->
<complete input="alm[9:0].data_out[1:2] alm[9:0].data_out[4:5]" name="LAB_dataout" output="LAB.data_out"/>
  • As for the 2:1 muxes, I did not see their definition in the arch XML and I think Stratix4 does not have any. Do you expect me to add these muxes? If we increase the Fc_out values, it means that we move the local routing multiplexers to the switch blocks. It may increase the routing area.

  • I did not see any crossbars for the DSP inputs. Here is the XML codes of the titan arch. related to the DSP data input connection in the current master. We have only direct connections.

<direct input="DSP.data_in[287:144]" name="data_in_bot" output="half_DSP[1].data_in">
  <delay_constant max="239e-12" min="132e-12" in_port="DSP.data_in[287:144]" out_port="half_DSP[1].data_in"/>
</direct>
<direct input="DSP.data_in[143:0]" name="data_in_top" output="half_DSP[0].data_in">
  <delay_constant max="239e-12" min="132e-12" in_port="DSP.data_in[143:0]" out_port="half_DSP[0].data_in"/>
</direct>

Adding the input crossbar will improve the routability but should we recalculate the DSP block area? These crossbars are very area consuming. Similar question for the RAM blocks.

Regarding to the methodology, do you expect to apply all the modifications on my branch (in this pull request) or in separated branches and then merged together?

@vaughnbetz
Copy link
Contributor

@Bill-hbrhbr Adding Bill. Xifan, lets get Bill to make the overused rr-node print out feature. I don't want to overload you too much!
@helen-23 Also adding Helen since she is likely to get involved with the Titan architecture changes.

  1. Experiment 1: instance equivalence is less flexible than full equivalence, since you can't have signals go to multiple outputs (at least not without logic duplication, which we don't support). Sounds like that's an easy one, so if you can try changing the output equivalence to "full" and run to get a comparison that would be great Xifan.
  2. Experiment 2: Stratix IV actually has 36:1 crossbars on the inputs to DSP blocks and RAM blocks. I talked to Kevin today, and it wasn't intentional to leave them out -- we just never put them in through forgetting / running out of time. It didn't seem to cause many problems, but now I think it is. So the proper thing to do would be to group the inputs into 36:1 groups with input equivalence and crossbars, for both DSP and RAM blocks. If that is a lot of work, we could do a quick-and-dirty experiment with all the inputs to a RAM and DSP made equivalent.
    3: Experiment 3: You are right that increasing Fc,out will increase area, but Stratix IV does have 2:1 muxes between the LE outputs and the routing drivers. Since we have a full crossbar already on the logic block outputs we shouldn't add anymore flexibility there. But for the RAM and DSP blocks we should experiment with doubling Fc,out as it would better approximate having a 2:1 mux between the outputs and the routing drivers. My recollection is that those 2:1 muxes are in those blocks as well, but I am not 100% sure.

Based on the results of those, we could decide which architecture changes to commit (or to combine some and do an experiment 4).

Methodology: Xifan, if you can make the architectures for exp 1 - 3 and run them on your branch, and also pass them to Helen to run on the master (or if it is easier you can also run them on the master) that would be great. I'd like to get QoR data for all those items and then we can decide what to change in the architecture. Helen has some new circuits that she's getting into the Titan flow, and at least one of those showed some strange routing to RAM blocks on the master so I think the lack of input pin equivalence on RAM blocks may already be an issue (maybe the current Titan designs don't have that many RAM blocks?).

@vaughnbetz
Copy link
Contributor

Also, experiment 0.5 (before 1 above) is to play with various options for the cluster input and output pin utilization limits.
--target_ext_pin_util has the option. Right now we're using auto, which limits the logic block to 80% input usage when possible (with no other limits). Other interesting cases:

  • logic block has input and output limits of 0.8
  • all blocks have input and output limits of 0.8

@vaughnbetz
Copy link
Contributor

Filed the reporting overused rr-node issue to Bill: #1453

@tangxifan
Copy link
Contributor Author

tangxifan commented Jul 28, 2020

Hi @vaughnbetz
I just finished the experiment 0.5, where I set the LAB input and output utilization rate to 0.8 (--target_ext_pin_util LAB:0.8,0.8).
We have two benchmarks directrf and opencv passing the routing but a new benchmark sparc_T2 failed in routing.
I attached the zip file which includes

  • A spreadsheet that compares current QoR to the golden on master
  • vpr.out of the new passed and failed benchmarks
  • A summary of QoR check failure from parse_vtr_trask.pl

titan_LAB_io_util0.8.zip

@tangxifan
Copy link
Contributor Author

Hi @vaughnbetz
I havefinished another experiment 0.5, where I set all the logic block input and output utilization rate to 0.8 (--target_ext_pin_util 0.8,0.8).
This time, we have a benchmark opencv passing the routing but a new benchmark sparc_T2 failed in routing.
I attached the zip file which includes

A spreadsheet that compares current QoR to the golden on master
vpr.out of the new passed and failed benchmarks
A summary of QoR check failure from parse_vtr_trask.pl

titan_util0.8.zip

@tangxifan
Copy link
Contributor Author

Hi @vaughnbetz
The experiment 1 has been finished. After I used full equivalence to LAB outputs, all the titan benchmarks passed flow-run but have some QoR failures. A short summary on the QoR:

  • The QoR failures are due to some improvements due to the use of full pin equivalence.

  • Routing area is reduced by 5% on average

  • Almost no impact on critical path delay

  • Runtime overhead is still there. For example, the number of routing iterations is almost doubled.

    See detailed in the attached report.
    LAB_output_equivalent.zip

@vaughnbetz
Copy link
Contributor

This looks great. full instance equivalence definitely looks like a good thing to turn on.

@vaughnbetz
Copy link
Contributor

Just looked through the 0.8 input and 0.8 output pin utilization target experiment. It didn't change much: +2% cluster count, +2% placer estimated critical path, +5% routed wirelength, +9% routed critical path delay, +446% route time (probably really really +300% since machine load looks higher due to place time). All comparisons to the master.

@vaughnbetz
Copy link
Contributor

I've created a spreadsheet that summarizes the 4 runs so far. LAB output equivalence with the fixed rr-graph is very close to quality parity with the master (with no rr-graph fix). Some metrics are slightly worse, some slightly better. I'll keep this spreadsheet up to date with future results so we can track combinations might be promising, but output equivalence definitely looks good.
titan_arch_xifan.xlsx

@tangxifan
Copy link
Contributor Author

tangxifan commented Jul 31, 2020

Hi @vaughnbetz I am trying to re-run the titan tests using the latest master with the rr_graph fix-up (no arch. modification, only rr_graph modification), so that we can report overused nodes. But I detect a bug in the flow-run script, where the max_route_iterations is reduced from 400 to 150. An issued has been created at #1458
This is critical to my tests because as you see in the spreadsheet, the required number of routing iteration is more than doubled. As a result, I see most of the benchmarks failed this time. I would like to learn your advice about how to proceed.

  • Should I wait for a feedback on the bug?
  • Should I revert back to an old master and perform experiments 3 and 4? In this case, we may not have overused node reports.

…e Travis/Kokoro shows different QoR than the local machine. Should discuss if the patch is needed or not
@tangxifan
Copy link
Contributor Author

@litghost I have fine-tuned the architecture k6_N10_mem32K_40nm_i_and_o.xml by

  • forcing 'EMPTY' tiles on the left and right sides
  • using 'clb' tiles to fill the empty locations in the columns of mem32K and DSP
    This improves the routability of the architecture without conflicting what you want to test.
    Please let me know if this is fine for you.

FYI, I attached the grid layout (from VPR GUI) after architecture modification here.

  • Global view of the architecture:
    image

  • A close view on the architecture:

    • You can see that CLBs fill the empty locations in the columns of mem32K and DSP

    • CLB start from x=1, where x=0 is now filled by EMPTY tiles.
      image

    • Routing architecture surrounds the CLBs as a regular FPGA fabric.
      image

@vaughnbetz
Copy link
Contributor

Hi Xifan,

Mixing clbs and RAMs in a column like this is hard to lay out, so it's not normally done in commercial FPGAs. I don't object to testing it in this architecture, but I think you should put a comment at the top of the file explaining what this architecture does that is different and what it is intended to test. (Nobody did that for the dedicated input/output pads that it appears to have, so that's good to list in that comment too).

I don't think this architecture is used much in regtests and not in any automated QoR tests (which is good in my opinion, as with these changes it will be a bit more of an unusual architecture). Is that right?

@tangxifan
Copy link
Contributor Author

@vaughnbetz I agree on this. It will challenge the physical design a lot. This is indeed rare in commercial FPGAs as well as any tape-out we have done. Usually, we properly size the grids so that there are no such EMPTY holes in the DSP/RAM columns. These holes are seen only during the architecture exploration. Therefore, I have added comments to the head of the architecture XML in order to clarify these points. If I miss anything, feel free to let me know.

@vaughnbetz
Copy link
Contributor

Thanks Xifan; the new comment is very clear.

@tangxifan
Copy link
Contributor Author

Hi all, I have address all the comments so far. The regression tests now passed. Please let me know if it is good to merge.

@litghost
Copy link
Collaborator

Arch change looks fine to me

@vaughnbetz
Copy link
Contributor

Looks good.

We can merge this now, but

  1. I have a minor comment to add to the arch file (can go in a different PR).
  2. I think it would be good to run a final Titan QoR sweep of the current master vs. this new PR (is about to become the master) so we have it written down how QoR changed, and hence know how (roughly) to adjust expectations if someone has data compared to the old rr-graph and master. I know you measured that a couple of months ago Xifan, but I think it's important enough to measure again s we commit this.

I suspect we can merge in advance of 1 & 2 though but should do them ASAP after the merge. @tangxifan does that make sense?

@tangxifan
Copy link
Contributor Author

@vaughnbetz

  • No problem for the comments on the arch file. I can do it now.
  • I will rerun the titan regression test and collect the QoR. Do you want me to update the golden results for titan? Or just submit a spreadsheet comparing the QoR difference in this PR?

It will take ~1 day to finish and I think we can do it before merging.

@vaughnbetz
Copy link
Contributor

Thanks Xifan. I think it would be good to upload a spreadsheet of Titan QoR first, and then if all looks good, update the golden and merge.

@tangxifan
Copy link
Contributor Author

tangxifan commented Sep 16, 2020

Hi @vaughnbetz
I have finished the regression test for titan benchmark.
In the attached spreadsheet, I compare the QoR results of the rr_graph fix-up branch to the golden results of current master.
You can see that there is zero difference on the averaged critical path delay and an average of 5% improvement on the area.
I also used -check_golden and it said that the QoR passed checks.

titan_parse_results_RRG_fixup_vs_master.xlsx

@vaughnbetz
Copy link
Contributor

Thanks Xifan! This looks great, and I don't think we need to do an update golden as everything was close enough anyway. I think this is ready to merge. Let me know if you disagree, otherwise I'll merge it.

@tangxifan
Copy link
Contributor Author

@vaughnbetz Thanks for the advice. I am nothing against it.

@vaughnbetz vaughnbetz merged commit 74c8dbe into verilog-to-routing:master Sep 16, 2020
@vaughnbetz
Copy link
Contributor

Excellent -- glad to land this one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang-cpp C/C++ code tests VPR VPR FPGA Placement & Routing Tool VTR Flow VTR Design Flow (scripts/benchmarks/architectures)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants