-
Notifications
You must be signed in to change notification settings - Fork 415
Post routing cluster block pin fix-up #1355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi all, I am going to develop test cases for this feature. I plan to output post-synthesis/routing Verilog netlists and run quick verifications. Is there any existing test case for a similar purpose? This can help me a lot. Thanks! |
Yep take a look at the tests under:
And the -check_equivalent option to run_vtr_flow.pl. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tangxifan for proposing this. I think its important to get this correctly handled, as it is important for supporting logical equivalence when targeting real devices.
I think the main issue I see are:
- how this information is used after it is recorded. Currently it seems like this PR just puts the remapping in a side data structure.
- How this handles updating the cluster internal routing (e.g. for logical equivalence due to partially depopulated crossbars). It seems like that isn't considered by this PR, but is I think key to getting a general solution to this problem. Thoughts?
There are also some software engineering issues which I've noted below, particularly related to how/where we store this information as well as some other minor comments (see below).
Thanks Xifan. That will require some more software engineering to get the clusterer to reroute. I'm not sure if it's best to get that going first, land this partial solution and then add that, or focus on the special case of when the LUT inputs are directly connected to the cluster inputs (LUT rotation). Any thoughts? |
Hi Vaughn and Kevin, Currently, I think we can avoid the re-routing of clusters. What I did in OpenFPGA is just to adapt the |
Hi Xifan, "Currently, I think we can avoid the re-routing of clusters. What I did in OpenFPGA is just to adapt the pb_route results. Since we just swap the nets, the routing paths inside clusters will not be changed (this is what equivalent pins can guarantee). " I haven't looked at this code in detail, but I think what you're saying is that local changes to the pb_route results will work, because equivalent pins means there is a crossbar or some such so small changes to the pb_route should direct the new pin used by the router to the proper location. That seems logical to me. I assume there are minor fix ups needed though, and that's what you mean when you say "adapt the pb_route results". Is that right? |
Exactly. The |
@tangxifan How does the timing analyzer break? The post-clustering delay calculator caches some information about block internal connectivity (to avoid re-traversing the pb routes). Since we've historically not changed the pb_routes, it is possible that that caching may need to be invalidated when the pb routes are fixed up. |
Hi Kevin and Vaughn, However, as I dive more into details, I realize that a quick fix-up may not solve the problem. The major issue comes from the port equivalence called If we go for repacking, it will take longer to implement and coding complexity will grow significantly. It is a thorough and clean solution. If you want to keep it light, the current strategy should be o.k. for equivalent inputs.
|
I'd be OK with deferring the instance equivalence; if we don't support fix up in that case it would be OK with me. I think getting the simpler fix in now, and just making sure we note that instance equivalence isn't supported (and just make sure we don't run the fix up or otherwise don't crash) would be OK. |
Yep, this sounds like the right approach for now. In addition to noting this in the documentation, perhaps we should print out a warning in the VPR log file noting that this is not being fixed up. |
Hi all,
I would like to hear your advice on this point. I am close to finishing the pull request. |
Thanks Xifan. I think you've found a significant bug that should be fixed. vtr-verilog-to-routing/vpr/src/route/rr_graph.cpp Line 1460 in 6af4a38
And this line shows that when we set up those rr-indices with a unique index for each pin, on each side of a block:
That is incorrect; we should only increment the rr-index the first time we see a specific pin on a specific block; that will lead to one rr-index no matter how many sides of the block the pin appears on. Xifan, can you fix that code and see if it (1) resolves this issue and (2) doesn't cause any issues in the regtests? I am hopeful this won't be too hard; it looks like just changing when *index is incremented would work by adding an inode_allocated flag that is set to false at the start of the ipin loop, and set to true on the first *index increment and prevents further increments for that ipin. |
Hi Vaughn,
|
Thanks Xifan. Yes, some QoR loss is likely with this, but I hope it is small and expect it would be. I agree that tests of the details of the stratixiv arch's timing predictions on small circuits are likely to have significantly different QoR, so those failures are likely not a problem. @kmurray: you know those tests better so please weigh in if you disagree. Beyond the regtests (which look like small circuits failing) can you create a before and after average QoR comparison for the vtr benchmarks and the titan benchmarks from the nightly tests? That's the best check of QoR -- we want the change to be small on both those. The vtr benchmarks are run with an architecture with pins on only one side, so I would expect no QoR impact on them. Titan should take some hit, but we should see how much (crit_path_routing_time, crit_path, routed_wirelength will all probably change some; other metrics I would expect to be within noise). Can you attach a summary of that data when you have it? |
One more thing to check: please run the graphics on the titan architecture and basic vtr circuits architecture (two runs) and make sure drawing the routing and drawing the routing graph works correctly. |
Because this a new behavior / bug fix relating to equivalent pins, I believe there should be a corresponding documentation update about what is happening here. I believe it will help future users of equivalent pins understand VPR's behavior here without requiring them to read the code. |
Hi Vaughn, -VTR benchmark results:
My understanding of the min W reduction is that now the rr_graph has fewer dangling nodes in I/Os, which had 4 sides nodes but only 1 side of them are connected to the rest of the rr_graph. I attached the figure here to help you understand the difference. This helps router to resolve congestion and focus on routable paths during wavefront expansion. I attached the GUI results to help you understand the difference on the rr_graph fix-up Zoomed view on I/O and CLB: I/O on borders have a lot of redundant nodes and connections. Zoomed view on I/O and CLB: I/O on borders do not have redundant nodes and connections.
Zoomed view on I/O and LAB: same input pin are allocated on different sides Zoomed view on I/O and LAB: Only one input pin allocated on one side -For Titan benchmarks
|
Indeed. I have updated the doc by adding what not to do about the pins. |
Thanks Xifan. For the Titan architecture we should see each physical input pin connecting to two channels: the top and either left or right (half go left, half go right). Output pins are the same. In the pictures above, it looks like after the change the pin is drawn only one one side of the block (which is OK -- I'd probably prefer if it was drawn on multiple sides as it'll draw more nicely but if it draws on one it's not a huge deal), but that it also is now only connecting to one channel (top), which isn't right. I went through the code again in more detail and I think there is a problem: we should just not increment inode for these multi-side/multi-loc pins, but instead you're skipping making those connections to other sides (which simply ignores requests in the architecture file for pins to connect to more than one side). I've put comments in of what I think the fix is, but you should verify it by looking at the edges going into a couple of rr-nodes for the Stratix IV arch: we should see the union of the edges that used to go into two rr-nodes (each representing the same pin on a different side) now going into one rr-node that represents that pin. OPINs that connect to multiple sides will also have the same problem, so I've noted where I think you need a symmetric fix for them above. |
Hi Vaughn, |
Yes, that's exactly right Xifan. As I mentioned above, drawing that one node in two positions would be ideal (and I think that may happen automatically if you fix this). But the key thing is to have one rr-node, which has the union of switches to/from it as the (incorrectly) split rr-nodes that were put on different sides used to have. |
Hi Vaughn,
|
Hi Xifan, That is basically 3 failures due to circuits having higher channel widths (the higher channel widths lead to more routing area and cause the other QoR metric failures). I'd expect some channel width increase on architectures with pins on multiple sides since we just made them (correctly) less flexible. Hopefully it is smaller on bigger circuits (which tend to have higher channel widths so the percent increase may be less). Getting a QoR spreadsheet comparing the key stats on the vtr and titan benchmarks should show that. On the vtr ones, we don't expect any degradation. On Titan we expect some, but hopefully it is not too big. |
Hi Vaughn, Here is the summary:
|
…bled with packer verbosity
@acomodi Can you please review the changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except for the vpr_utils changes. I've asked @acomodi to please review those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments in
t_logical_block_type_ptr logical_block, | ||
int sub_tile_capacity, | ||
int pin) { | ||
int sub_tile_index = get_logical_block_physical_sub_tile_index(physical_tile, logical_block, sub_tile_capacity); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to get the sub_tile_index with the already present function get_sub_tile_index
:
vtr-verilog-to-routing/vpr/src/util/vpr_utils.cpp
Lines 721 to 745 in 37095c7
int get_sub_tile_index(ClusterBlockId blk) { | |
auto& place_ctx = g_vpr_ctx.placement(); | |
auto& device_ctx = g_vpr_ctx.device(); | |
auto& cluster_ctx = g_vpr_ctx.clustering(); | |
auto logical_block = cluster_ctx.clb_nlist.block_type(blk); | |
auto block_loc = place_ctx.block_locs[blk]; | |
auto loc = block_loc.loc; | |
int sub_tile_coordinate = loc.sub_tile; | |
auto type = device_ctx.grid[loc.x][loc.y].type; | |
for (const auto& sub_tile : type->sub_tiles) { | |
if (sub_tile.capacity.is_in_range(sub_tile_coordinate)) { | |
auto result = std::find(sub_tile.equivalent_sites.begin(), sub_tile.equivalent_sites.end(), logical_block); | |
if (result == sub_tile.equivalent_sites.end()) { | |
VPR_THROW(VPR_ERROR_PLACE, "The Block Id %d has been placed in an incompatible sub tile location.\n", blk); | |
} | |
return sub_tile.index; | |
} | |
} | |
VPR_THROW(VPR_ERROR_PLACE, "The Block Id %d has been placed in an impossible sub tile location.\n", blk); | |
} |
This requires the ClusterBlockId to retrieve the placed location, but I think it is worth using this instead of adding a new overloaded get_logical_block_physical_sub_tile_index
function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The function should be able to simplify the problem but it requires the ClusterBlockId
which is a mapping result and is not available in the input argument. When created this function, my principle is to get device-level information with device-level information only. If the ClusterBlockId
is required, it becomes a data query depending on mapping results. That being said, I am open to use other functions as long as this rule is respected. Feel free to bug me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so, if you don't have at disposal the ClusterBlockID when calling this function, you cannot make use of the already existing helper function.
I think that the best way forward (probably in a follow-up PR) is to encapsulate these helper functions, that need to rely only on the device information, within the physical tile type data model, as previously discussed. For the time being, having overloaded functions does not seem to be a big issue.
Once I'll get to add some more description on how this pin mapping works, I will see if I can correctly encapsulate the functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. I think it would be helpful to clarify in the comments which APIs rely only on the device information and which need mapping information. That will better guide future developers when reworking these APIs. At least, in this pull request, I can add comments to clarify these points. We just need to agree on whether these functions should require only device information or not.
vpr/src/util/vpr_utils.cpp
Outdated
return (sub_tile_capacity - physical_tile->sub_tiles[sub_tile_index].capacity.low) * logical_block->pb_type->num_pins | ||
+ physical_tile->sub_tiles[sub_tile_index].sub_tile_to_tile_pin_indices[sub_tile_physical_pin]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this calculation might lead to issues in case the logical block is a subset of the sub tile. In other words, if we use an equivalent logical block which has fewer number of pins, but can still fit in the sub tile, the returned pin would be invalid.
I think here there should be a similar calculation as in here:
vtr-verilog-to-routing/vpr/src/util/vpr_utils.cpp
Lines 2166 to 2183 in 37095c7
int max_num_block_pins = sub_tile.num_phy_pins / sub_tile.capacity.total(); | |
/* Logical location and physical location is offset by z * max_num_block_pins */ | |
int rel_capacity = place_ctx.block_locs[iblk].loc.sub_tile - sub_tile.capacity.low; | |
for (auto pin : clb_nlist.block_pins(iblk)) { | |
int logical_pin_index = clb_nlist.pin_logical_index(pin); | |
int sub_tile_pin_index = get_sub_tile_physical_pin(sub_tile_index, physical_tile, logical_block, logical_pin_index); | |
int new_physical_pin_index = sub_tile.sub_tile_to_tile_pin_indices[sub_tile_pin_index + rel_capacity * max_num_block_pins]; | |
auto result = place_ctx.physical_pins.find(pin); | |
if (result != place_ctx.physical_pins.end()) { | |
place_ctx.physical_pins[pin] = new_physical_pin_index; | |
} else { | |
place_ctx.physical_pins.insert(pin, new_physical_pin_index); | |
} | |
} |
Basically the needed variables are:
- max_num_block_pins: which is the number of pins within one capacity instance of the sub_tile
- relative_capacity: which is indeed
sub_tile_capacity - physical_tile->sub_tiles[sub_tile_index].capacity.low
Those variables can be used to retrieve the correct physical pin from the sub_tile_to_tile_pin_indices
map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I have not considered the different number of pins between equivalent sites. I believe that we will hit a bug without any changing. However, as I stated in your previous comment, the function should not involve any mapping results but only device parameters. Since we know the sub_tile
index in the function, I would suggest to find the maximum number of pins among the logical_block
in the equivalent sites of this subtile. Is this o.k. for you? If you have any existing API, I would like to use!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this PR code, I think that the modified get_physical_pin
function gets called only once here:
int physical_pin = get_physical_pin(physical_tile, logical_block, sub_tile_z, pb_type_pin); |
I see that there actually is a cluster block id at disposal. Basically, you could potentially use it IMO, as its purpose within the pin mapping utils function is to retrieve the location of that cluster block in the device.
Anyhow, I think that getting the max number of pins from the logical block might not be correct. Basically, you have all the information in the sub_tile
, which you can use to get the maximum number of pins for each sub_tile
location.
E.g.:
int max_num_block_pins = sub_tile.num_phy_pins / sub_tile.capacity.total();
The fact is that, you can always have the biggest logical block that fits in a sub tile that does not occupy all the physical pins that are available. If you get the maximum number of pins
information from the logical block, you might end up in the same issue of mapping the logical pin at a wrong physical pin location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that there actually is a cluster block id at disposal. Basically, you could potentially use it IMO, as its purpose within the pin mapping utils function is to retrieve the location of that cluster block in the device.
I need to correct this. Actually, using the clusterBlockId won't work as the placement has not been run yet when running the post routing cluster block fixup (correct?). If I am wrong and the block has been already placed, than you could use the already existing utility functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Thanks for the advice on finding the proper number of pins for the subtile. I have adapted the code accordingly.
- You are right regarding the use of
ClusterBlockId
. I can introduce it as an input argument but I have another consideration here. This function is also used in OpenFPGA when building the Verilog netlists modeling an FPGA fabric itself. As a fabric builder, there should be no dependency on mapping results. That is why I am trying to make the function only requiring device-level information. IMHO, the function is more general-purpose in that shape, and it can be used more widely. Feel free to let me know your vision here.
…lacement_physical_pin()
…n computing the physical pin index
Add codes to fix up the net swapping due to routing optimization when logic equivalent pins are defined in
pb_type
.The fix-up is called when routing is finished successfully.
Description
ClusterContext
without modifying the originalclb_nlist
.rr_nodes
. The results are stored in theRoutingContext
. The modification is required as it is fast to create such annotation for all the nets andrr_nodes
. This database is frequently inquired by the post-routing fix-up function. This can avoid frequent searches on therr_graph
and ``ClusteringContext` data structures.vpr_utils.cpp
. The modification is required because the old functionget_physical_tile_pin()
does not consider thez_offset
in any sub tile's capacity when finding the pin index. This will cause a wrong pin index is founded when multiple sub tiles withcapacity > 0
are defined under a physical tile.Related Issue
To address the issue.
Motivation and Context
How Has This Been Tested?
Types of changes
Checklist: