-
Notifications
You must be signed in to change notification settings - Fork 415
VPR Memory Error when input pin equivalence applied to DSPs and RAMs #1475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi Helen, |
Checked CBRR, t_rr_node_storage, t_rr_node_route_inf, and t_edge_size. All have the ability to store at least short/uint16 values in everthing (pin numbers, edge counts, etc.) so no overflow in them with large IPIN or OPIN equivalence classes with more than 256 members. So it doesn't look like it's overflow, unless it's happening in some rr-graph building code temporary structures (which is still possible; will look at them next). |
Checked the rr-graph building code and rr_node_indices and again don't see anything that is using char for storage of rr-related data. So I don't see an issue with overflow in the rr-graph. |
Also no problems I can see in the routing data structures (t_rt_tree and t_trace). No chars in them, nothing that should overflow. |
Hi @vaughnbetz
I attached my |
Thanks Xifan. That's in the same routine: prune_route_tree_recurr. Suggested things to try:
|
Hi @vaughnbetz @tangxifan , I've tried running VPR with --router_high_fanout_threshold -1, but it didn't seem to have turned off the function. The run still aborted in the same routine: prune_route_tree_recurr. Is it another option that turns the function off? In the previous run, DLA_BSC and DLA_ELT aborted when trying to access the 'child' attribute of 'node', while DLA_LRN aborted at an assertion error, same as what Xifan saw in his run. This time, however, all three designs ended at trying to access "edge->child". I also took a look into the logic of the code. prune_route_tree_recurr() is called by prune_route_tree(), which is called from route_timing.cpp. Before each time prune_route_tree() is called, there is a function that checks whether the tree is valid: But this assertion would only take affect if VTR assertion level is set to 3. I'm thinking about maybe trying it out. If the assertion is never triggered but VPR keeps failing in prune_route_tree, then it's probably not the node pointers and edge pointers themselves going out of bound, but rather (I suspect) something odd in the structure of the tree or how the nodes and edges are deleted during pruning. |
Also tried adding assertions to check the validity of the node tree at various locations of the code: (Wild guess) Is it possible that the input pin equalization somehow introduced multiple edges between the same pair of nodes? |
Nice detective work Helen!
|
Thanks, Vaughn, sounds like a plan. |
Also trying just enabling M9K input pin equivalence on a smaller Titan benchmark. |
Hi @vaughnbetz
|
Thanks Xifan and Helen. Xifan, Helen told me today that she has a candidate bug location identified: it looks like the route_tree code does not properly remove connections the look like: and so on. I think this may never have been exercised in the code before because logic blocks (when they have input equivalence on) build a clustered netlist that just brings a signal in once (connects to one input). So there wouldn't be a reason to construct a route tree like this. I suspect the general (RAM, DSP, ...) clustered netlist building code is instead building one connection per input pin to a RAM or DSP block, despite the pins being equivalent. That is legal, so the root bug really is in the router not properly handling this case (i.e. I'd rather fix the router than change the netlist builder, although this might be fixable in the clustered netlist builder as well). Helen is going to try fixing the route_tree bug and hopefully that will fix this. |
Thank you, Vaughn, for explaining it all! Another issue came up though. For the case that we've observed (as Vaughn mentioned above), here is an example of such tree structure that failed during delete_route_tree() (did print_node_tree() to the log file): During deletion, a depth-first approach is used to delete each node from the leafs and upwards. Once node 33371 is deleted from the first edge that reaches it, subsequent deletions of node 33371 from the other 10 edges would result in segfault. Same thing would happen if the prune_route_tree() routine is called. I fixed the delete_route_tree() so that it only removes the first instance of a child node if there are multiple of the same child node under one parent node, and it worked: However, as the test went on further, another type of tree structure showed up, where a child node is associated with multiple parent nodes. This is not considered a valid tree structure, thus an assertion error was reported with check_valid_skeleton_tree().: |
Thanks Helen. I think that is another bug -- a SINK should be allowed to be part of more than one tree branch when it has capacity > 1. So I think relaxing the error check for that case is fine, and hopefully fixes this bug. |
Thanks, Vaughn, Fixed delete_route_tree() and prune_route_tree() to only free a node once in the tree. One more concern though: an rt_node struct has a parent_node and a parent_switch member associated with it. If a node is allowed to be part of more than one tree branch, its parent_node member now only points to one of its parents. I had to reconstruct the collect_route_tree_connections() logic because it was making use of the node's parent_node member, which was pointing at the wrong node in some cases. That was an easy fix, but I wasn't sure how to deal with parent_switch. As of now, I am assuming that a node's parent_switch is the same across all of its parent nodes. My small dot product circuit test finally passed most of routing with input pin equivalence applied to the DSP blocks, but failed with the following error:
I'll run some other titan circuits that use M9Ks and only apply pin equivalence to the M9K blocks, and see how that goes with my code modifications. I'm still worried if the parent node issue might cause problems in other parts of the logic. What would you recommend as the best way to deal with them? |
Please see branch "allow_ram_dsp_input_equivalence" |
Hi Helen, I can't see that branch. Maybe it isn't pushed to github yet (i.e. is only local on your machine)? For the non-tree errors: I think there are two basic ways we could attack it.
This mapping is only used in route_tree_timing.cpp to rapidly figure out where rt_nodes that we are branching off of (for nets with fanout > 1) are, so we can quickly attach the routing of a new connection to the route_tree. We'll never branch off a SINK (it is the end of a connection) so storing only one of the rr_node -> rt_node mappings for a SINK (if the SINK is inserted multiple times) should be OK. But that should be commented in this file. If that wasn't OK we could change the mapping to store a vector or list for each entry, but I think we don't need to (can just update the comment). I think with this approach (#2) we can still use the parent pointers, parent_switch info etc. in the rt_tree so that should be OK. |
I think there is a bug in check_sink. It should be marking netlist pins as sinks that could match them are found, but the code isn't correct if you can connect to the same sink more than once.
I've fixed the code below (haven't compiled or tested though, but hopefully this works). Code is also simpler :).
While you are at it, the code below looks old and out of date, and should be replaced with an
|
Hi Vaughn, Thank you for the detailed explanation! Yes I can see how #2 would work the best. Currently working on it and will update whether it works or not. I think I got the link to my branch wrong, it should be the following: Now, I'm noticing a bunch of regtest failures in this branch due to items like: I didn't touch that function so I'm not sure what's going on. I'll have to look into that later as well. |
We have regtests that set compiler warnings as errors so somebody gets rid of the warnings. Not sure if the declaration for that function was accidentally deleted (which would trigger the warning)? |
Hi Vaughn, Here's part of my code:
In the caller function, I made pin_done a set and inserted the first net in the set, which follows the logic of the original code
And for checking whether net is connected to pins:
|
Thanks @vaughnbetz for the explanation. I understand better where the bug comes from. |
Thanks Xifan. That makes sense. In parallel, I'd like to get this bug fixed too, since I think it is still a bug (but I agree, modifying the architecture file would change the netlist so that this case doesn't happen for the architecture we're creating). Helen, that is odd the code doesn't work -- it looked so good :). One alternative is to not check the driver at all, by changing the check loop to only go over sinks: But I am not sure as I thought your code above would work too! |
Hi Vaughn, The total number of sink pins on some nets are actually more than the number of those associated with the appropriate blocks, so after removing this loop I tested this hypothesis on the master branch (without any of the route_tree_timing.cpp changes), and printed out the total number of sink pins vs number of sink pins that were actually checked: When I removed the two lines. The same failures occurred in master. Bringing back the loop and the check, the following code actually worked for all regtests and all our tests used for testing pin equivalence:
|
@helen-23 I just checked these codes and I think you are right.
is necessary.
|
Thank you @tangxifan. Ah I see. I've replaced
with the following: |
I think another factor in the code above working is that it isn't really checking that we find a sink matching each pin in the net; it just checks that we find N sinks for a fanout N net, and each sink appears to connect to a block that is reasonable (matches some pin on this net). That's a weaker check than the one I was trying to code: match a sink to exactly one pin on the net, and then don't match a future pin to that same net. The ipin code in both the above and the original degenerates to a simple count of sinks I believe, as it just fills in the next entry in pin_done each time. It would be better to replace pin_done and ipin with a set over pin_id as it would give us the test that each sink implements a different net connection. Unfortunately, I think that's the code that's failing, for reasons I don't fully understand (maybe the iclass check is failing sometimes? That check is a bit less fundamental.). So I suggest before leaving this one:
|
Hi @vaughnbetz I've changed pin_done back to a set and it worked for the regtests and the two small test designs (dot product & Carpat).
|
OK, I like that code better, thanks :). makes the test fail? If so, baffling, but I'll take it :). Without those lines I think I understand everything; still can't figure out why they are needed. |
@vaughnbetz No worries. I agree that this is a bug that should be fixed as well. FYI, I have adapted the Titan architecture XML to grant DSP input and M144K output pin equivalence. You can find detailed results in the pull request #1448 |
Hi @vaughnbetz Yes, confirming (double checked to make sure) that with the modified code, removing those two lines still makes the test fail. With regards to the issue we discussed on Friday, where VPR failed to route the DLA circuits due to some OPIN-type nodes having a net delay of 0, the problem is actually not in the route tree net delay updates. In fact, when it is checking for SINK nodes that have a net delay of 0, OPIN nodes shouldn't even show up as part of the check. The vector remaining_targets is used to store the IDs of remaining SINK nodes awaiting to be reached during pruning. All node IDs in the vector are then converted into corresponding SINK pin IDs through It then loops through all remaining targets (for each target_pin in remaining_targets) and updates the rt_node_of_sink array through Problem is that all repeated nodes have the same node ID, so when node IDs are converted into pin IDs, the same pin ID would appear multiple times in remaining_targets ( e.g. remaining_targets = [1,1,3,3,5,5] ). Consequently, when rt_node_of_sink[target_pin] is updated, the same element could be updated multiple times while others are not updated at all. As a result, the elements that are not updated could be storing some random pin IDs. This is why when we reached some random OPIN nodes when we are checking through all SINK pins:
This means that although the smaller circuits passed, it's likely that they passed by chance (i.e. only random SINK nodes were stored in the elements that were not updated so that even though it was wrong, it did not trigger the assertion). Currently trying out fix for this. @tangxifan Thank you for the fix in the architecture! This is super helpful as it would bypass any issues that could arise with multiple copies of the same nodes. |
Actually, the smaller circuits passed because they have very low fanout (less than min_incremental_reroute_fanout) and no pruning is required. In that case, remaining_targets is populated simply through:
No node-to-sink conversion is needed in this case, so there's no problem. For larger circuits like DLA, however, we would want to prune the tree. During pruning, remaining_targets is first populated by node IDs and then converted into pin IDs. This is where the problem occurs. |
Update: Helen is working on a fix where she stores the net_pin (connection id) of each connection on the route tree and traceback, so there is no ambiguity. |
I see a segmentation fault when routing starts, when I add a sparse 50% populated input crossbar to a hardblock in the architecture. This is the same observation that Helen and Xifan have, as discussed in this issue. There are 4 main blocks in the arch I have: CLB, DSP, BRAM and matmul (custom hardblock I've added). CLB already had the crossbar. I added a crossbar to the other 3. The crossbar to DSP works. I don't see the segmentation fault with that. But when I add a crossbar to either the BRAM or the matmul, I get the segmentation fault. I don't understand a lot of the VPR internals that are discussed here. But having different cases that fail and pass may be helpful in the debug process. So, I am attaching my arch files and the design. The key for the names of the arch file is: "_yes" means that that block has a local crossbar specified. "_no" means that that block doesn't have a local crossbar specified. The arch "stratix10.matmul_no.dsp_yes.bram_no.xml" is the one that doesn't run into the segmentation fault. The other two do. |
A couple more things... Since there's some discussion on different ways of expressing the crossbar... if you want to see the crossbars in my arch files, best thing to do is to diff stratix10.matmul_no.dsp_yes.bram_no.xml file with the other two. And.. If you guys have a temp fix, please let me know. I can try to get that in my fork, and build and run and see how it goes. |
I have a question about working around this issue until it is fixed. For now, I am specifying the connections using "direct" instead of "complete" and specifying the local crossbar delay (obtained from COFFE) in those "direct" interconnect statements. That feels like paying double penalty (router will have difficulty routing because there is no local crossbar AND the delay specified is reasonably high). @vaughnbetz , is there a better way to model this until this issue is fixed? or is what I am doing okay. |
Hi Aman, I think that's overly conservative; I would use either the lower delay and no pin equivalence or the higher delay and pin equivalence. For large blocks we have found another issue though -- the router lookahead is always heading to the lower-left corner of the block, making routing to other pins suboptimal. Making pins that are in distinct locations on a large block (different x,y locations on the block) seems to exacerbate this by giving the router more flexibility. Xifan is working on a fix for this, but it will take a while. In the interim, if you see poor routing to large blocks you can use a lower astar_fac (e.g. 0.75 or perhaps even 0.5) but it will slow the router down. See #1508 for some discussion on this (near the end). |
Thanks, @vaughnbetz ! |
Coming back to this.. I am still seeing a segmentation fault. I tried on the latest master today. Note that I am providing the full details of the crossbar. I am pasting here snapshots of the the differences in two arch files. Left one has no local crossbar on the memory block. Right one has local crossbar on the memory block. When I use the left one, things work. When I use the right one, I see a seg fault. Am I doing something incorrectly? In these arch files, I have local crossbar on the DSP block in the same way and that works fine. |
I posted it on the VTR user group as well (https://groups.google.com/g/vtr-users/c/3Fcuo4r4wz4), in case someone else sees it. Please respond there if you can. Appreciate any help. |
Adding Xifan in case he can share his partial crossbar for memory blocks,
as that may help. From looking at the arch file it's not apparent to me
what 's causing the segfault. Alas, we need more user-friendly errors on
arch file parsing for cases like this.
Best,
Vaughn
…On Sun, Sep 6, 2020 at 4:16 PM Aman Arora ***@***.***> wrote:
I posted it on the VTR user group as well (
https://groups.google.com/g/vtr-users/c/3Fcuo4r4wz4), in case someone
else sees it. Please respond there if you can. Appreciate any help.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1475 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABNDPJ6I4YT6EHWCSRAVLTDSEPUYDANCNFSM4PYBXDKQ>
.
|
Thanks, @vaughnbetz @tangxifan , it'll be very helpful if you can share your XML arch with the crossbar description. |
@aman26kbm |
Thanks, Xifan. I looked at it and it is very similar to how I'm doing it. The weird thing is that it works for DSP slice. But it doesn't work for memory (and for another hard block I am adding to the architecture). So somehow the crossbar on memory is triggered some specific code path in VPR that has a bug. I am attaching a zip file that has a arch and a design file. If you run them, you'll see the segmentation fault. |
@vaughnbetz , regarding this point that you had mentioned:
I have some confusion here. Wanted to enumerate the options we have to get some clarity. Option#1: Option#2: Option #3: Option#4: Do you have a recommendation of what the reasonable thing to do is, for now, until the crash issue is fixed? |
I would do #2 if it works (used to have a bug, but Helen fixed that so
hopefully it now works). It is almost the same as #1.
Note that for large blocks the lookahead is not great right now (always
tries to go to the lower-left corner of the block), so a lower astar_fac
(.75, maybe even .5 if the cpu time is not crazy) may be helpful.
If #2 and #1 don't work, #4 is next best as it is self-consistent (but
harder to route). With #4 you could also leave out the local crossbar
area, as you are not using one (more self-consistent still).
Vaughn
…On Wed, Sep 9, 2020 at 11:12 AM Aman Arora ***@***.***> wrote:
@vaughnbetz <https://github.com/vaughnbetz> , regarding this point that
you had mentioned:
I think that's overly conservative; I would use either the lower delay and
no pin equivalence or the higher delay and pin equivalence.
I have some confusion here. Wanted to enumerate the options we have to get
some clarity.
Option#1:
Express the crossbar (using "complete" keyword) + specify delay of the
local mux from COFFE (using the "delay_constant" tag) + specify pin
equivalence ("equivalent = full") -> This is the right thing, but is
crashing.
Option#2:
No crossbar specification (use "direct" keyword) + specify delay of the
local mux from COFFE for these direct wires (using the "delay_constant"
tag) + specify pin equivalence ("equivalent = full") -> I think this is the
quick and dirty thing mentioned above. Haven't tried this, may crash or may
work.
Option #3
<#3>:
No crossbar specification (use "direct" keyword) + specify delay of the
local mux from COFFE + no pin equivalence -> This is the overly
conservative thing mentioned above.
Option#4:
No crossbar specification (use "direct" keyword) + 0 delay + no pin
equivalence.
A variation of option #4
<#4>
is (option 4.5):
No crossbar specification (use "direct" keyword) + some fraction of the
local mux delay obtained from coffe + no pin equivalence.
Do you have a recommendation of what the reasonable thing to do is, for
now, until the crash issue is fixed?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1475 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABNDPJ72I6FADXYNQKNNLHDSE6LPFANCNFSM4PYBXDKQ>
.
|
Thanks. I'll try #2 tomorrow to see what happens. |
Uh oh!
There was an error while loading. Please reload this page.
When running the titan_new DLA benchmark variants with the stratixiv_arch.timing.xml architecture under Titan flow, VPR succeeded. However, after input pin equivalence is applied to the M9K block in stratixiv_arch.timing.xml, VPR reported memory access failures during routing.
Expected Behaviour
VPR should pass when input pin equivalence is applied to the DSP and RAM blocks in stratixiv_arch.timing.xml,
Current Behaviour
when input pin equivalence is applied to the M9K block in stratixiv_arch.timing.xml,
For the DLA_BSC and DLA_ELT benchmarks, VPR reported a segmentation fault during prune_route_tree_recurr() in route_tree_timing.cpp. This occurred at the beginning of routing. Please see attached VPR log files and screenshot of command line error messages for details.
For the DLA_LRN benchmark, VPR aborted at assert(node) due to node being NULL in the same function mentioned above. Please see attached VPR log file and screenshot of command line error messages for details.
Possible Solution
Maybe there are char-type variables in the router that are used to specify size, instead of uint16, so some values ran over the top.
Steps to Reproduce
<titan_dir>/scripts/titan_flow.py \
-q DLA_BSC/quartus2_proj/DLA.qpf \
-a stratixiv_arch.timing.experiment2.xml \
--fit \
--gen_post_fit_netlist \
--titan_dir <titan_dir> \
--vqm2blif_dir <vtr_root_dir>/build/utils/vqm2blif \
--quartus_dir /tools/intel/install/fpga/18.1/standard/quartus/bin \
--vpr_dir <vtr_root_dir>/vpr
<vtr_root_dir>/vpr/vpr \
stratixiv_arch.timing.experiment2.xml \
DLA_stratixiv_post_fit.blif \
--sdc_file vpr.sdc \
--route_chan_width 300 \
--max_router_iterations 400 \
--timing_analysis on \
--timing_report_npaths 1000
Context
Trying out different architecture experiments to make VPR Fmax results more comparable to that of Quartus II for large circuits that are RAM-extensive.
Your Environment
Files
DLA_BSC_vpr_stdout.log.zip


DLA_ELT_vpr_stdout.log.zip
DLA_LRN_vpr_stdout.log.zip
DLA_BSC.zip
DLA_ELT.zip
DLA_LRN.zip
vpr.sdc.zip
stratixiv_arch.timing.experiment2.xml.zip
The text was updated successfully, but these errors were encountered: