Skip to content

Post routing cluster block pin fix-up #1355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 57 commits into from
Nov 8, 2020

Conversation

tangxifan
Copy link
Contributor

@tangxifan tangxifan commented Jun 10, 2020

Add codes to fix up the net swapping due to routing optimization when logic equivalent pins are defined in pb_type.
The fix-up is called when routing is finished successfully.

Description

  • Add post routing cluster block pin fix-up. The results are stored in the ClusterContext without modifying the original clb_nlist.
  • Add post routing net annotation to rr_nodes. The results are stored in the RoutingContext. The modification is required as it is fast to create such annotation for all the nets and rr_nodes. This database is frequently inquired by the post-routing fix-up function. This can avoid frequent searches on the rr_graph and ``ClusteringContext` data structures.
  • Add util functions to find physical tile pins with given capacity offset to the vpr_utils.cpp. The modification is required because the old function get_physical_tile_pin() does not consider the z_offset in any sub tile's capacity when finding the pin index. This will cause a wrong pin index is founded when multiple sub tiles with capacity > 0 are defined under a physical tile.

Related Issue

To address the issue.

Motivation and Context

How Has This Been Tested?

  • This feature has passed existing regression tests.
  • New test cases should be added as planned in the issue.

Types of changes

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation
  • I have updated the documentation accordingly
  • I have added tests to cover my changes
  • All new and existing tests passed

@probot-autolabeler probot-autolabeler bot added lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool labels Jun 10, 2020
@tangxifan
Copy link
Contributor Author

Hi all, I am going to develop test cases for this feature. I plan to output post-synthesis/routing Verilog netlists and run quick verifications. Is there any existing test case for a similar purpose? This can help me a lot. Thanks!

@kmurray
Copy link
Contributor

kmurray commented Jun 11, 2020

Is there any existing test case for a similar purpose? This can help me a lot. Thanks!

Yep take a look at the tests under:

  • vtr_flow/tasks/regression_tests/vtr_reg_strong/strong_func_formal_flow/
  • vtr_flow/tasks/regression_tests/vtr_reg_strong/strong_func_formal_vpr/

And the -check_equivalent option to run_vtr_flow.pl.

Copy link
Contributor

@kmurray kmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tangxifan for proposing this. I think its important to get this correctly handled, as it is important for supporting logical equivalence when targeting real devices.

I think the main issue I see are:

  • how this information is used after it is recorded. Currently it seems like this PR just puts the remapping in a side data structure.
  • How this handles updating the cluster internal routing (e.g. for logical equivalence due to partially depopulated crossbars). It seems like that isn't considered by this PR, but is I think key to getting a general solution to this problem. Thoughts?

There are also some software engineering issues which I've noted below, particularly related to how/where we store this information as well as some other minor comments (see below).

@vaughnbetz
Copy link
Contributor

Thanks Xifan.
As Kevin mentioned, getting the clusterer to reroute the cluster internals to match the cluster inputs/outputs chosen by the router is important to get a consistent representation and full solution.

That will require some more software engineering to get the clusterer to reroute. I'm not sure if it's best to get that going first, land this partial solution and then add that, or focus on the special case of when the LUT inputs are directly connected to the cluster inputs (LUT rotation). Any thoughts?

@tangxifan
Copy link
Contributor Author

tangxifan commented Jun 11, 2020

Hi Vaughn and Kevin,
Thanks for the constructive advice.
I have not finished the coding yet. As you said, the post-synthesis netlist generation as well as the packing results have not been updated yet. I will do that in follow-up commits. I will start developing regression tests at the same time.

Currently, I think we can avoid the re-routing of clusters. What I did in OpenFPGA is just to adapt the pb_route results. Since we just swap the nets, the routing paths inside clusters will not be changed (this is what equivalent pins can guarantee). This is still valid for the case where LUT inputs are exposed on the PB inputs. We just need to ensure that the LUT truth tables are adapted accordingly. I will bring some codes to fix that in follow-up commits if needed.
This one should work fine in terms of functional correctness. But it may cause timing worse. I agree that if you want best QoR, we should reroute the clusters. Personally, I would do this in another pull request. We will recompute time criticalities for each cluster and then reroute.

@vaughnbetz
Copy link
Contributor

Hi Xifan,

"Currently, I think we can avoid the re-routing of clusters. What I did in OpenFPGA is just to adapt the pb_route results. Since we just swap the nets, the routing paths inside clusters will not be changed (this is what equivalent pins can guarantee). "

I haven't looked at this code in detail, but I think what you're saying is that local changes to the pb_route results will work, because equivalent pins means there is a crossbar or some such so small changes to the pb_route should direct the new pin used by the router to the proper location. That seems logical to me. I assume there are minor fix ups needed though, and that's what you mean when you say "adapt the pb_route results". Is that right?

@tangxifan
Copy link
Contributor Author

Hi Xifan,

"Currently, I think we can avoid the re-routing of clusters. What I did in OpenFPGA is just to adapt the pb_route results. Since we just swap the nets, the routing paths inside clusters will not be changed (this is what equivalent pins can guarantee). "

I haven't looked at this code in detail, but I think what you're saying is that local changes to the pb_route results will work, because equivalent pins means there is a crossbar or some such so small changes to the pb_route should direct the new pin used by the router to the proper location. That seems logical to me. I assume there are minor fix ups needed though, and that's what you mean when you say "adapt the pb_route results". Is that right?

Exactly. The pb_route stores the routing tree of clusters as the trace for global routing. We can simply swap the routing tree between these nets in an equivalent port. This should work as I have tested both fully-connected crossbars and 50% depopulated crossbars in OpenFPGA.

@kmurray
Copy link
Contributor

kmurray commented Jun 18, 2020

@tangxifan How does the timing analyzer break?

The post-clustering delay calculator caches some information about block internal connectivity (to avoid re-traversing the pb routes). Since we've historically not changed the pb_routes, it is possible that that caching may need to be invalidated when the pb routes are fixed up.

@tangxifan
Copy link
Contributor Author

tangxifan commented Jun 18, 2020

Hi Kevin and Vaughn,
Thanks for the advice. I am applying fix-up routing traces of logic blocks.
As you may see, after updating the routing trace data structure pb_route, I broke the timing analyzer which complained about inconsistency on net ids. I am looking into details about how to solve that and I think it should not be a roadblock. @kmurray I see the cached netlist and fast look-ups in the timing analyzer. I am updating my fix-up so that it will not complain.

However, as I dive more into details, I realize that a quick fix-up may not solve the problem. The major issue comes from the port equivalence called instance documentation. It is a strong property that allows the router to swap logic block outputs even when the outputs are not equivalent. As said in the documentation, we must repack (not only reroute) these logic blocks. Simple swapping routing trees can not resolve this problem. This feature has been used in the regression tests a lot.

If we go for repacking, it will take longer to implement and coding complexity will grow significantly. It is a thorough and clean solution. If you want to keep it light, the current strategy should be o.k. for equivalent inputs.
Therefore, I would like to hear your advice on this.

 Best regards,
 Xifan Tang

@vaughnbetz
Copy link
Contributor

I'd be OK with deferring the instance equivalence; if we don't support fix up in that case it would be OK with me. I think getting the simpler fix in now, and just making sure we note that instance equivalence isn't supported (and just make sure we don't run the fix up or otherwise don't crash) would be OK.

@kmurray
Copy link
Contributor

kmurray commented Jun 29, 2020

I'd be OK with deferring the instance equivalence; if we don't support fix up in that case it would be OK with me. I think getting the simpler fix in now, and just making sure we note that instance equivalence isn't supported (and just make sure we don't run the fix up or otherwise don't crash) would be OK.

Yep, this sounds like the right approach for now.

In addition to noting this in the documentation, perhaps we should print out a warning in the VPR log file noting that this is not being fixed up.

@tangxifan
Copy link
Contributor Author

tangxifan commented Jul 8, 2020

Hi all,
My codes on updating pb_route have passed most regression tests (basic and strong) but failed in the stratix4 architecture.
During debugging, I have a critical concern on the architecture.
The problem is on the LAB data_in port, which is defined on multiple sides of the physical tile.
This may cause multiple drivers that are not allowed in circuit designs.
Actually, I do find such problem:
Two different nets styr_in_6_ and n_n262 are mapped to the same pin data_in[1].
This is validated in the styr.route file (See star lines in the following code block):

Net 12 (styr_in_6_)

Node:	685	SOURCE (1,4)  Pad: 145  Switch: 0
Node:	1024	  OPIN (1,4)  Pad: 145  Switch: 2
Node:	12635	 CHANX (1,4) to (4,4)  Track: 8  Switch: 1
Node:	6367	  IPIN (4,4)  Pin: 13   LAB.data_in[13] Switch: 0
Node:	6332	  SINK (4,4)  Class: 0  Switch: -1
Node:	12635	 CHANX (1,4) to (4,4)  Track: 8  Switch: 1
Node:	4404	  IPIN (3,4)  Pin: 12   LAB.data_in[12] Switch: 0
Node:	4371	  SINK (3,4)  Class: 0  Switch: -1
Node:	12635	 CHANX (1,4) to (4,4)  Track: 8  Switch: 1
Node:	2431	  IPIN (2,4)  Pin: 6   LAB.data_in[6] Switch: 0
Node:	2410	  SINK (2,4)  Class: 0  Switch: -1
Node:	12635	 CHANX (1,4) to (4,4)  Track: 8  Switch: 2
Node:	12863	 CHANY (4,5)  Track: 8  Switch: 2
Node:	12697	 CHANX (5,5) to (8,5)  Track: 2  Switch: 2
Node:	12952	 CHANY (7,3) to (7,5)  Track: 11  Switch: 2
Node:	12608	 CHANX (4,3) to (7,3)  Track: 5  Switch: 1
Node:	6070	  IPIN (4,3)  Pin: 0   LAB.data_in[0] Switch: 0
Node:	6061	  SINK (4,3)  Class: 0  Switch: -1
Node:	685	SOURCE (1,4)  Pad: 145  Switch: 0
Node:	1025	  OPIN (1,4)  Pad: 145  Switch: 2
Node:	12720	 CHANY (0,1) to (0,4)  Track: 9  Switch: 2
Node:	12587	 CHANX (1,3) to (3,3)  Track: 4  Switch: 1
Node:	2168	  IPIN (2,3)  Pin: 10   LAB.data_in[10] Switch: 0
Node:	2139	  SINK (2,3)  Class: 0  Switch: -1
Node:	12587	 CHANX (1,3) to (3,3)  Track: 4  Switch: 1
Node:	4131	  IPIN (3,3)  Pin: 11   LAB.data_in[11] Switch: 0
Node:	4100	  SINK (3,3)  Class: 0  Switch: -1
Node:	12720	 CHANY (0,1) to (0,4)  Track: 9  Switch: 2
Node:	12545	 CHANX (1,2) to (4,2)  Track: 4  Switch: 1
Node:	3858	  IPIN (3,2)  Pin: 10   LAB.data_in[10] Switch: 0
Node:	3829	  SINK (3,2)  Class: 0  Switch: -1
Node:	12545	 CHANX (1,2) to (4,2)  Track: 4  Switch: 1
**Node:	1879	  IPIN (2,2)  Pin: 1   LAB.data_in[1] Switch: 0**
Node:	1868	  SINK (2,2)  Class: 0  Switch: -1
Net 3 (n_n262)

Node:	4375	SOURCE (3,4)  Class: 4  Switch: 0
Node:	4615	  OPIN (3,4)  Pin: 118   LAB.data_out[29] Switch: 0
Node:	6360	  IPIN (4,4)  Pin: 9   LAB.data_in[9] Switch: 0
Node:	6332	  SINK (4,4)  Class: 0  Switch: -1
Node:	4375	SOURCE (3,4)  Class: 4  Switch: 0
Node:	4573	  OPIN (3,4)  Pin: 97   LAB.data_out[8] Switch: 0
Node:	2556	  IPIN (2,4)  Pin: 68   LAB.data_in[68] Switch: 0
Node:	2410	  SINK (2,4)  Class: 0  Switch: -1
Node:	4573	  OPIN (3,4)  Pin: 97   LAB.data_out[8] Switch: 2
Node:	12793	 CHANY (2,4) to (2,5)  Track: 2  Switch: 1
Node:	4401	  IPIN (3,4)  Pin: 10   LAB.data_in[10] Switch: 0
Node:	4371	  SINK (3,4)  Class: 0  Switch: -1
Node:	4573	  OPIN (3,4)  Pin: 97   LAB.data_out[8] Switch: 2
Node:	12776	 CHANY (2,1) to (2,4)  Track: 5  Switch: 1
Node:	3841	  IPIN (3,2)  Pin: 1   LAB.data_in[1] Switch: 0
Node:	3829	  SINK (3,2)  Class: 0  Switch: -1
Node:	12776	 CHANY (2,1) to (2,4)  Track: 5  Switch: 2
Node:	12603	 CHANX (3,3) to (6,3)  Track: 2  Switch: 2
Node:	12621	 CHANX (7,3) to (8,3)  Track: 10  Switch: 1
Node:	10066	  IPIN (7,3)  Pin: 5   LAB.data_in[5] Switch: 0
Node:	10047	  SINK (7,3)  Class: 0  Switch: -1
Node:	12603	 CHANX (3,3) to (6,3)  Track: 2  Switch: 1
Node:	6076	  IPIN (4,3)  Pin: 3   LAB.data_in[3] Switch: 0
Node:	6061	  SINK (4,3)  Class: 0  Switch: -1
Node:	12776	 CHANY (2,1) to (2,4)  Track: 5  Switch: 2
Node:	12594	 CHANX (1,3) to (2,3)  Track: 11  Switch: 1
Node:	2160	  IPIN (2,3)  Pin: 6   LAB.data_in[6] Switch: 0
Node:	2139	  SINK (2,3)  Class: 0  Switch: -1
Node:	12776	 CHANY (2,1) to (2,4)  Track: 5  Switch: 1
Node:	4114	  IPIN (3,3)  Pin: 2   LAB.data_in[2] Switch: 0
Node:	4100	  SINK (3,3)  Class: 0  Switch: -1
Node:	12776	 CHANY (2,1) to (2,4)  Track: 5  Switch: 2
Node:	12477	 CHANX (3,0) to (6,0)  Track: 4  Switch: 2
Node:	12807	 CHANY (3,1) to (3,4)  Track: 6  Switch: 1
Node:	5822	  IPIN (4,2)  Pin: 11   LAB.data_in[11] Switch: 0
Node:	5790	  SINK (4,2)  Class: 0  Switch: -1
Node:	12776	 CHANY (2,1) to (2,4)  Track: 5  Switch: 2
Node:	12470	 CHANX (1,0) to (2,0)  Track: 13  Switch: 2
Node:	12747	 CHANY (1,1) to (1,2)  Track: 6  Switch: 1
**Node:	1880	  IPIN (2,2)  Pin: 1   LAB.data_in[1] Switch: 0**
Node:	1868	  SINK (2,2)  Class: 0  Switch: -1

I would like to hear your advice on this point. I am close to finishing the pull request.

@vaughnbetz
Copy link
Contributor

vaughnbetz commented Jul 8, 2020

Thanks Xifan. I think you've found a significant bug that should be fixed.
This routine shows that we go through finding the rr-index for each pin, on each side of a block, in order to connect it to other nodes:

for (int width_offset = 0; width_offset < type->width; ++width_offset) {

And this line shows that when we set up those rr-indices with a unique index for each pin, on each side of a block:

That is incorrect; we should only increment the rr-index the first time we see a specific pin on a specific block; that will lead to one rr-index no matter how many sides of the block the pin appears on.
So this means in rr_graph2.cpp / load_block_rr_indices we should change the code so it increments *index only once for each pin on a block, regardless of how many sides it appears on. It also shouldn't be incremented no matter how many distinct y_offset and x_offset locations it appears at. I don't know if we have any architectures that put the same logical pin at multiple y_offset and x_offset locations, but we should fix that at the same time.

Xifan, can you fix that code and see if it (1) resolves this issue and (2) doesn't cause any issues in the regtests? I am hopeful this won't be too hard; it looks like just changing when *index is incremented would work by adding an inode_allocated flag that is set to false at the start of the ipin loop, and set to true on the first *index increment and prevents further increments for that ipin.

@tangxifan
Copy link
Contributor Author

tangxifan commented Jul 8, 2020

Hi Vaughn,
Thanks for the help. It saves me a lot of time in bug fixing.
As you see, I have patched the routing resource graph generator by limiting only a unique rr_node for any input pin (receiver) which may locate on multiple sides.
All regression tests passed in flow-run but I see the QoR failure in the titan architectures and related benchmarks.
But I think that it is expected as we uniquify the rr_node, routing channel width and wire length may increase because previously there are multiple nets mapped to the same pin and now we force legal routing (more routing resources are required).
If you are o.k. with that, I will just update the golden results.
I attached the report of QoR failure here FYI.

   regression_tests/vtr_reg_strong/strong_place_delay_model...[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta routed_wirelength: golden = 702 result = 1058
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_override min_chan_width: golden = 18 result = 26
regression_tests/vtr_reg_strong/strong_place_delay_calc_method...[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_--place_delta_delay_matrix_calculation_method_astar routed_wirelength: golden = 702 result = 1058
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_override_--place_delta_delay_matrix_calculation_method_astar min_chan_width: golden = 18 result = 26
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_--place_delta_delay_matrix_calculation_method_dijkstra min_chan_width: golden = 16 result = 24

@vaughnbetz
Copy link
Contributor

Thanks Xifan. Yes, some QoR loss is likely with this, but I hope it is small and expect it would be.

I agree that tests of the details of the stratixiv arch's timing predictions on small circuits are likely to have significantly different QoR, so those failures are likely not a problem. @kmurray: you know those tests better so please weigh in if you disagree.

Beyond the regtests (which look like small circuits failing) can you create a before and after average QoR comparison for the vtr benchmarks and the titan benchmarks from the nightly tests? That's the best check of QoR -- we want the change to be small on both those. The vtr benchmarks are run with an architecture with pins on only one side, so I would expect no QoR impact on them. Titan should take some hit, but we should see how much (crit_path_routing_time, crit_path, routed_wirelength will all probably change some; other metrics I would expect to be within noise). Can you attach a summary of that data when you have it?

@vaughnbetz
Copy link
Contributor

One more thing to check: please run the graphics on the titan architecture and basic vtr circuits architecture (two runs) and make sure drawing the routing and drawing the routing graph works correctly.

@litghost
Copy link
Collaborator

litghost commented Jul 9, 2020

Because this a new behavior / bug fix relating to equivalent pins, I believe there should be a corresponding documentation update about what is happening here. I believe it will help future users of equivalent pins understand VPR's behavior here without requiring them to read the code.

@probot-autolabeler probot-autolabeler bot added the docs Documentation label Jul 9, 2020
@tangxifan
Copy link
Contributor Author

tangxifan commented Jul 9, 2020

One more thing to check: please run the graphics on the titan architecture and basic vtr circuits architecture (two runs) and make sure drawing the routing and drawing the routing graph works correctly.

Hi Vaughn,
I have tried GUI for both architectures used by the VTR and Titan Benchmarks. They are working fine.
I have finished the regression test of regression_tests/vtr_reg_weekly/titan_task_list.txt and tr_reg_weekly/vtr_reg_qor_chain_predictor_off. The results are as follows.

-VTR benchmark results:

  • Flow-run all passed
  • 1 QoR check failure
    • benchmark diffeq2.v
    • Minimum routing channel width reduced from 50 to 46 (Golden was 58, Local test of current master is 50)
  • To study what impact the routing results, I have applied the modification of rr_graph on latest master. And I see the QoR change and thus the rr_graph build-up changes the QoR.

My understanding of the min W reduction is that now the rr_graph has fewer dangling nodes in I/Os, which had 4 sides nodes but only 1 side of them are connected to the rest of the rr_graph. I attached the figure here to help you understand the difference. This helps router to resolve congestion and focus on routable paths during wavefront expansion.
I attached QoR results FYI.
vtr_parse_results.txt

I attached the GUI results to help you understand the difference on the rr_graph fix-up

  • Benchmark: diffeq2, Architecture: k6_frac_N10_frac_chain_mem32K_40nm.xml

  • Current master
    Overview:
    image3

Zoomed view on I/O and CLB: I/O on borders have a lot of redundant nodes and connections.
image4

Routing nets
image5

  • After fix-up
    Overview:
    image

Zoomed view on I/O and CLB: I/O on borders do not have redundant nodes and connections.
image1

Routing nets
image2

  • Benchmark: ucsb_152_tap_fir_stratixiv_arch_timing, Architecture: stratixiv_arch.timing.xml

  • Current master
    Overview:
    image6

Zoomed view on I/O and LAB: same input pin are allocated on different sides
image

Routing nets
image8

  • After fix-up
    Overview:
    image9

Zoomed view on I/O and LAB: Only one input pin allocated on one side
image10

Routing nets
image11

-For Titan benchmarks

  • Benchmarks failed in flow-run due to W=300 is not routable now
    • SparcT1_core
    • SparcT2_core
    • Bitcoin_miner
    • Gaussianblur

@tangxifan
Copy link
Contributor Author

tangxifan commented Jul 9, 2020

Because this a new behavior / bug fix relating to equivalent pins, I believe there should be a corresponding documentation update about what is happening here. I believe it will help future users of equivalent pins understand VPR's behavior here without requiring them to read the code.

Indeed. I have updated the doc by adding what not to do about the pins.
Personally, I suggest this fix-up to a mandatory function in the flow, which is always called after routing is finished.
Users will not be bothered since it will modify nothing for non-equivalent pins.
As such, we can avoid exposing unnecessary details to users.

@vaughnbetz
Copy link
Contributor

Thanks Xifan. For the Titan architecture we should see each physical input pin connecting to two channels: the top and either left or right (half go left, half go right). Output pins are the same. In the pictures above, it looks like after the change the pin is drawn only one one side of the block (which is OK -- I'd probably prefer if it was drawn on multiple sides as it'll draw more nicely but if it draws on one it's not a huge deal), but that it also is now only connecting to one channel (top), which isn't right.

I went through the code again in more detail and I think there is a problem: we should just not increment inode for these multi-side/multi-loc pins, but instead you're skipping making those connections to other sides (which simply ignores requests in the architecture file for pins to connect to more than one side). I've put comments in of what I think the fix is, but you should verify it by looking at the edges going into a couple of rr-nodes for the Stratix IV arch: we should see the union of the edges that used to go into two rr-nodes (each representing the same pin on a different side) now going into one rr-node that represents that pin.

OPINs that connect to multiple sides will also have the same problem, so I've noted where I think you need a symmetric fix for them above.

@tangxifan
Copy link
Contributor Author

Thanks Xifan. For the Titan architecture we should see each physical input pin connecting to two channels: the top and either left or right (half go left, half go right). Output pins are the same. In the pictures above, it looks like after the change the pin is drawn only one one side of the block (which is OK -- I'd probably prefer if it was drawn on multiple sides as it'll draw more nicely but if it draws on one it's not a huge deal), but that it also is now only connecting to one channel (top), which isn't right.

I went through the code again in more detail and I think there is a problem: we should just not increment inode for these multi-side/multi-loc pins, but instead you're skipping making those connections to other sides (which simply ignores requests in the architecture file for pins to connect to more than one side). I've put comments in of what I think the fix is, but you should verify it by looking at the edges going into a couple of rr-nodes for the Stratix IV arch: we should see the union of the edges that used to go into two rr-nodes (each representing the same pin on a different side) now going into one rr-node that represents that pin.

OPINs that connect to multiple sides will also have the same problem, so I've noted where I think you need a symmetric fix for them above.

Hi Vaughn,
I think I get the wrong message initially and just consider the first rr_node in all the sides.
Now I see that what you propose is that we have a unique rr_node but can be indexed through multiple sides.
Then, routing channels at different sides can still reach the same node.
If I do not mistranslate the message, I should be able to correct this problem very quick.

@vaughnbetz
Copy link
Contributor

Yes, that's exactly right Xifan.

As I mentioned above, drawing that one node in two positions would be ideal (and I think that may happen automatically if you fix this). But the key thing is to have one rr-node, which has the union of switches to/from it as the (incorrectly) split rr-nodes that were put on different sides used to have.

@tangxifan
Copy link
Contributor Author

tangxifan commented Jul 10, 2020

Yes, that's exactly right Xifan.

As I mentioned above, drawing that one node in two positions would be ideal (and I think that may happen automatically if you fix this). But the key thing is to have one rr-node, which has the union of switches to/from it as the (incorrectly) split rr-nodes that were put on different sides used to have.

Hi Vaughn,
I have applied this bug fix and I see the following QoR degrades (only on the stratix4 arch) in vtr_strong_test
In addition to minimum routable channel width, I see some increase on the metrics crit_path_routing_area_total and crit_path_routing_area_per_tile, which I currently have no clue.
If you have any insights, please let me know. I am trying to think about if this should be a side effect or not.

regression_tests/vtr_reg_strong/strong_custom_grid...[Fail] 
 non_column.xml/raygentop.v/common min_chan_width: golden = 42 result = 56
[Fail] 
 non_column.xml/raygentop.v/common crit_path_routing_area_total: golden = 3.33682e+06 result = 4.38776e+06
[Fail] 
 non_column.xml/raygentop.v/common crit_path_routing_area_per_tile: golden = 3064.11 result = 4029.17
regression_tests/vtr_reg_strong/strong_place_delay_model...[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_override min_chan_width: golden = 18 result = 24
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_override crit_path_routing_area_total: golden = 84868.6 result = 113235.
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_override crit_path_routing_area_per_tile: golden = 1212.41 result = 1617.64
regression_tests/vtr_reg_strong/strong_place_delay_calc_method...[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_override_--place_delta_delay_matrix_calculation_method_astar min_chan_width: golden = 18 result = 24
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_override_--place_delta_delay_matrix_calculation_method_astar crit_path_routing_area_total: golden = 84868.6 result = 113235.
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_override_--place_delta_delay_matrix_calculation_method_astar crit_path_routing_area_per_tile: golden = 1212.41 result = 1617.64
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_--place_delta_delay_matrix_calculation_method_dijkstra min_chan_width: golden = 16 result = 22
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_--place_delta_delay_matrix_calculation_method_dijkstra min_chan_width_routing_area_total: golden = 60092.3 result = 81870.2
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_--place_delta_delay_matrix_calculation_method_dijkstra min_chan_width_routing_area_per_tile: golden = 858.461 result = 1169.57
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_--place_delta_delay_matrix_calculation_method_dijkstra crit_path_routing_area_total: golden = 74567.7 result = 98656.2
[Fail] 
 stratixiv_arch.timing.xml/styr.blif/common_--place_delay_model_delta_--place_delta_delay_matrix_calculation_method_dijkstra crit_path_routing_area_per_tile: golden = 1065.25 result = 1409.37

@vaughnbetz
Copy link
Contributor

Hi Xifan,

That is basically 3 failures due to circuits having higher channel widths (the higher channel widths lead to more routing area and cause the other QoR metric failures). I'd expect some channel width increase on architectures with pins on multiple sides since we just made them (correctly) less flexible. Hopefully it is smaller on bigger circuits (which tend to have higher channel widths so the percent increase may be less). Getting a QoR spreadsheet comparing the key stats on the vtr and titan benchmarks should show that. On the vtr ones, we don't expect any degradation. On Titan we expect some, but hopefully it is not too big.

@tangxifan
Copy link
Contributor Author

tangxifan commented Jul 10, 2020

Hi Vaughn,
I have collected the results of VTR benchmarks and also checked the GUI. I am waiting for Titan to finish but I can share preliminary results here.

Here is the summary:

  • VTR benchmarks passed with 1 QoR failure as shown before:
    • benchmark diffeq2.v

    • Minimum routing channel width reduced from 50 to 46 (Golden was 58, Local test of current master is 50)

    • I have pulled out a spreadsheet comparing the golden results to the parsed results. See attached. Most metrics have negligible degradation. What looks significant is the min W of benchmark diffeq2 and mkSMAAdapter. I think it is reasonable.
      vtr_parse_results.xlsx

    • GUI all displayed without errors. VTR architecture has no difference from previous screenshots. Titan arch does have a different drawing. I showed the key difference in the zoomed view on LAB. You can see only the edges from SOURCE/SINK to the top side IPIN/OPIN are drawn. This is due to that currently each rr_node can only have one side recorded in the node-level attribute, even though it can be indexed through many sides in the fast look-up. When we build the rr_node, we always follow this sequence: TOP->RIGHT->BOTTOM->LEFT. As a result, the node-level attribute rr_node.side() is labeled as TOP only. GUI just draw that out. If we need a good GUI, we have to extend the node-level attribute and also adapt GUI codes. This may impact more codes.

      • Benchmark: ucsb_152_tap_fir_stratixiv_arch_timing, Architecture: stratixiv_arch.timing.xml

image

  • Titan Benchmark: a few failed in flow-run due to W=300 is not routable now
    • OpenCV
    • Bitcoin
    • Directrf
    • Gaussianblur is the only one still running but I think it will fail as before.
      I will keep you posted when I have the similar speadsheet comparing QoR.

@litghost litghost requested a review from acomodi October 28, 2020 18:41
@litghost
Copy link
Collaborator

@acomodi Can you please review the changes to vpr_utils.h/.cpp, thanks

Copy link
Collaborator

@litghost litghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for the vpr_utils changes. I've asked @acomodi to please review those changes.

Copy link
Collaborator

@acomodi acomodi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments in

t_logical_block_type_ptr logical_block,
int sub_tile_capacity,
int pin) {
int sub_tile_index = get_logical_block_physical_sub_tile_index(physical_tile, logical_block, sub_tile_capacity);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to get the sub_tile_index with the already present function get_sub_tile_index:

int get_sub_tile_index(ClusterBlockId blk) {
auto& place_ctx = g_vpr_ctx.placement();
auto& device_ctx = g_vpr_ctx.device();
auto& cluster_ctx = g_vpr_ctx.clustering();
auto logical_block = cluster_ctx.clb_nlist.block_type(blk);
auto block_loc = place_ctx.block_locs[blk];
auto loc = block_loc.loc;
int sub_tile_coordinate = loc.sub_tile;
auto type = device_ctx.grid[loc.x][loc.y].type;
for (const auto& sub_tile : type->sub_tiles) {
if (sub_tile.capacity.is_in_range(sub_tile_coordinate)) {
auto result = std::find(sub_tile.equivalent_sites.begin(), sub_tile.equivalent_sites.end(), logical_block);
if (result == sub_tile.equivalent_sites.end()) {
VPR_THROW(VPR_ERROR_PLACE, "The Block Id %d has been placed in an incompatible sub tile location.\n", blk);
}
return sub_tile.index;
}
}
VPR_THROW(VPR_ERROR_PLACE, "The Block Id %d has been placed in an impossible sub tile location.\n", blk);
}

This requires the ClusterBlockId to retrieve the placed location, but I think it is worth using this instead of adding a new overloaded get_logical_block_physical_sub_tile_index function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The function should be able to simplify the problem but it requires the ClusterBlockId which is a mapping result and is not available in the input argument. When created this function, my principle is to get device-level information with device-level information only. If the ClusterBlockId is required, it becomes a data query depending on mapping results. That being said, I am open to use other functions as long as this rule is respected. Feel free to bug me.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so, if you don't have at disposal the ClusterBlockID when calling this function, you cannot make use of the already existing helper function.

I think that the best way forward (probably in a follow-up PR) is to encapsulate these helper functions, that need to rely only on the device information, within the physical tile type data model, as previously discussed. For the time being, having overloaded functions does not seem to be a big issue.

Once I'll get to add some more description on how this pin mapping works, I will see if I can correctly encapsulate the functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I think it would be helpful to clarify in the comments which APIs rely only on the device information and which need mapping information. That will better guide future developers when reworking these APIs. At least, in this pull request, I can add comments to clarify these points. We just need to agree on whether these functions should require only device information or not.

Comment on lines 2353 to 2354
return (sub_tile_capacity - physical_tile->sub_tiles[sub_tile_index].capacity.low) * logical_block->pb_type->num_pins
+ physical_tile->sub_tiles[sub_tile_index].sub_tile_to_tile_pin_indices[sub_tile_physical_pin];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this calculation might lead to issues in case the logical block is a subset of the sub tile. In other words, if we use an equivalent logical block which has fewer number of pins, but can still fit in the sub tile, the returned pin would be invalid.

I think here there should be a similar calculation as in here:

int max_num_block_pins = sub_tile.num_phy_pins / sub_tile.capacity.total();
/* Logical location and physical location is offset by z * max_num_block_pins */
int rel_capacity = place_ctx.block_locs[iblk].loc.sub_tile - sub_tile.capacity.low;
for (auto pin : clb_nlist.block_pins(iblk)) {
int logical_pin_index = clb_nlist.pin_logical_index(pin);
int sub_tile_pin_index = get_sub_tile_physical_pin(sub_tile_index, physical_tile, logical_block, logical_pin_index);
int new_physical_pin_index = sub_tile.sub_tile_to_tile_pin_indices[sub_tile_pin_index + rel_capacity * max_num_block_pins];
auto result = place_ctx.physical_pins.find(pin);
if (result != place_ctx.physical_pins.end()) {
place_ctx.physical_pins[pin] = new_physical_pin_index;
} else {
place_ctx.physical_pins.insert(pin, new_physical_pin_index);
}
}

Basically the needed variables are:

  • max_num_block_pins: which is the number of pins within one capacity instance of the sub_tile
  • relative_capacity: which is indeed sub_tile_capacity - physical_tile->sub_tiles[sub_tile_index].capacity.low

Those variables can be used to retrieve the correct physical pin from the sub_tile_to_tile_pin_indices map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I have not considered the different number of pins between equivalent sites. I believe that we will hit a bug without any changing. However, as I stated in your previous comment, the function should not involve any mapping results but only device parameters. Since we know the sub_tile index in the function, I would suggest to find the maximum number of pins among the logical_block in the equivalent sites of this subtile. Is this o.k. for you? If you have any existing API, I would like to use!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this PR code, I think that the modified get_physical_pin function gets called only once here:

int physical_pin = get_physical_pin(physical_tile, logical_block, sub_tile_z, pb_type_pin);

I see that there actually is a cluster block id at disposal. Basically, you could potentially use it IMO, as its purpose within the pin mapping utils function is to retrieve the location of that cluster block in the device.

Anyhow, I think that getting the max number of pins from the logical block might not be correct. Basically, you have all the information in the sub_tile, which you can use to get the maximum number of pins for each sub_tile location.

E.g.:

int max_num_block_pins = sub_tile.num_phy_pins / sub_tile.capacity.total();

The fact is that, you can always have the biggest logical block that fits in a sub tile that does not occupy all the physical pins that are available. If you get the maximum number of pins information from the logical block, you might end up in the same issue of mapping the logical pin at a wrong physical pin location.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that there actually is a cluster block id at disposal. Basically, you could potentially use it IMO, as its purpose within the pin mapping utils function is to retrieve the location of that cluster block in the device.

I need to correct this. Actually, using the clusterBlockId won't work as the placement has not been run yet when running the post routing cluster block fixup (correct?). If I am wrong and the block has been already placed, than you could use the already existing utility functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Thanks for the advice on finding the proper number of pins for the subtile. I have adapted the code accordingly.
  • You are right regarding the use of ClusterBlockId. I can introduce it as an input argument but I have another consideration here. This function is also used in OpenFPGA when building the Verilog netlists modeling an FPGA fabric itself. As a fabric builder, there should be no dependency on mapping results. That is why I am trying to make the function only requiring device-level information. IMHO, the function is more general-purpose in that shape, and it can be used more widely. Feel free to let me know your vision here.

@tangxifan tangxifan requested a review from acomodi November 3, 2020 23:13
@vaughnbetz vaughnbetz merged commit 1983615 into verilog-to-routing:master Nov 8, 2020
@tangxifan tangxifan deleted the pb_pin_fixup branch November 12, 2020 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants