Skip to content

VPR Memory Error when input pin equivalence applied to DSPs and RAMs #1475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
helen-23 opened this issue Aug 7, 2020 · 58 comments · Fixed by #1508
Closed

VPR Memory Error when input pin equivalence applied to DSPs and RAMs #1475

helen-23 opened this issue Aug 7, 2020 · 58 comments · Fixed by #1508
Labels
help wanted VPR VPR FPGA Placement & Routing Tool

Comments

@helen-23
Copy link
Contributor

helen-23 commented Aug 7, 2020

When running the titan_new DLA benchmark variants with the stratixiv_arch.timing.xml architecture under Titan flow, VPR succeeded. However, after input pin equivalence is applied to the M9K block in stratixiv_arch.timing.xml, VPR reported memory access failures during routing.

Expected Behaviour

VPR should pass when input pin equivalence is applied to the DSP and RAM blocks in stratixiv_arch.timing.xml,

Current Behaviour

when input pin equivalence is applied to the M9K block in stratixiv_arch.timing.xml,

  1. For the DLA_BSC and DLA_ELT benchmarks, VPR reported a segmentation fault during prune_route_tree_recurr() in route_tree_timing.cpp. This occurred at the beginning of routing. Please see attached VPR log files and screenshot of command line error messages for details.

  2. For the DLA_LRN benchmark, VPR aborted at assert(node) due to node being NULL in the same function mentioned above. Please see attached VPR log file and screenshot of command line error messages for details.

Possible Solution

Maybe there are char-type variables in the router that are used to specify size, instead of uint16, so some values ran over the top.

Steps to Reproduce

  1. checkout my branch, "vqm2bliff_one_lut_removal", which contains all changes required to run the DLA variants
  2. unzip attached DLA circuits (DLA_BSC, DLA_ELT, and DLA_LRN)
  3. unzip the modified architecture file (stratixiv_arch.timing.experiment2.xml)
  4. Run titan_flow script with the DLA circuits and the modified architecture file. DO NOT run titan_flow.py with sanitizer build turned on, because there is currently integer overflow in the hash function due to multiply-add. Please do note to turn on options --fit and --gen_post_fit_netlist, because the DLA circuits need post-fit netlist for VPR. An example of the command looks like the following:

<titan_dir>/scripts/titan_flow.py \
-q DLA_BSC/quartus2_proj/DLA.qpf \
-a stratixiv_arch.timing.experiment2.xml \
--fit \
--gen_post_fit_netlist \
--titan_dir <titan_dir> \
--vqm2blif_dir <vtr_root_dir>/build/utils/vqm2blif \
--quartus_dir /tools/intel/install/fpga/18.1/standard/quartus/bin \
--vpr_dir <vtr_root_dir>/vpr

  1. unzip vpr.sdc
  2. Now with sanitizer build turned on, run VPR with the post-fit BLIF and vpr.sdc. An example of the command looks like the following:

<vtr_root_dir>/vpr/vpr \
stratixiv_arch.timing.experiment2.xml \
DLA_stratixiv_post_fit.blif \
--sdc_file vpr.sdc \
--route_chan_width 300 \
--max_router_iterations 400 \
--timing_analysis on \
--timing_report_npaths 1000

Context

Trying out different architecture experiments to make VPR Fmax results more comparable to that of Quartus II for large circuits that are RAM-extensive.

Your Environment

  • VTR revision used: 8.0
  • Operating System and version: Linux Ubuntu 18.04.4 LTS (Bionic Beaver)
  • Compiler version:

Files

DLA_BSC_vpr_stdout.log.zip
DLA_ELT_vpr_stdout.log.zip
DLA_BSC_and_DLA_ELT_seg_fault_error_message
DLA_LRN_vpr_stdout.log.zip
DLA_LRN_assertion_error_message
DLA_BSC.zip
DLA_ELT.zip
DLA_LRN.zip
vpr.sdc.zip
stratixiv_arch.timing.experiment2.xml.zip

@helen-23 helen-23 added the VPR VPR FPGA Placement & Routing Tool label Aug 7, 2020
@helen-23 helen-23 assigned vaughnbetz and helen-23 and unassigned vaughnbetz and helen-23 Aug 7, 2020
@vaughnbetz
Copy link
Contributor

Hi Helen,
If you rerun vpr with
--router_high_fanout_threshold -1
it should turn off the code that uses that function. That will let us see if that function is the problem, or if it is just the first location that crashes when memory has been corrupted somewhere else. Can you run that test and post the results here?

@vaughnbetz
Copy link
Contributor

Checked CBRR, t_rr_node_storage, t_rr_node_route_inf, and t_edge_size. All have the ability to store at least short/uint16 values in everthing (pin numbers, edge counts, etc.) so no overflow in them with large IPIN or OPIN equivalence classes with more than 256 members. So it doesn't look like it's overflow, unless it's happening in some rr-graph building code temporary structures (which is still possible; will look at them next).

@vaughnbetz
Copy link
Contributor

Checked the rr-graph building code and rr_node_indices and again don't see anything that is using char for storage of rr-related data. So I don't see an issue with overflow in the rr-graph.

@vaughnbetz
Copy link
Contributor

Also no problems I can see in the routing data structures (t_rt_tree and t_trace). No chars in them, nothing that should overflow.
@tangxifan : adding you in case you're seeing (or not seeing a similar) problem on your branch when you make M9K or other blocks have equivalent inputs. I think doing M9Ks alone is probably a good first step; they are not as large as the other blocks so one less thing that is changing (don't exercise as many code paths).

@tangxifan
Copy link
Contributor

Hi @vaughnbetz
I have tried on my side and see the same router failure.
It seems that in my test, it is an assertation error at route_tree_timing.cpp:938, which says the node id is invalid.

/research/ece/lnis/USERS/tang/github/vtr-verilog-to-routing/vpr/src/route/route_tree_timing.cpp:938 prune_route_tree_recurr: Assertion 'node' failed.

I attached my vpr.out FYI
neuron_vpr.zip

@vaughnbetz
Copy link
Contributor

Thanks Xifan. That's in the same routine: prune_route_tree_recurr. Suggested things to try:

  1. rerun vpr with --router_high_fanout_threshold -1 (as I mentioned to Helen above) to turn off that code and see if it completes. If it does, we've narrowed it down.

  2. If that crashes too, then I think trying an experiment with only M9Ks having input pin equivalence (assuming that's not what you did) is a good idea, to narrow down the change. This case crashes for Helen, so I expect it to crash.

  3. Then look at the rr-graph created for the M9K and see if there is something strange in the IPIN connectivity (especially to the sink, and that there is one sink for the equivalent inputs) and that net_rr_terminals is correct (all connections ending at a single M9K RAM go to that same sink).

@helen-23
Copy link
Contributor Author

helen-23 commented Aug 9, 2020

Hi @vaughnbetz @tangxifan ,

I've tried running VPR with --router_high_fanout_threshold -1, but it didn't seem to have turned off the function. The run still aborted in the same routine: prune_route_tree_recurr. Is it another option that turns the function off?

In the previous run, DLA_BSC and DLA_ELT aborted when trying to access the 'child' attribute of 'node', while DLA_LRN aborted at an assertion error, same as what Xifan saw in his run. This time, however, all three designs ended at trying to access "edge->child".

I also took a look into the logic of the code. prune_route_tree_recurr() is called by prune_route_tree(), which is called from route_timing.cpp. Before each time prune_route_tree() is called, there is a function that checks whether the tree is valid:
check_tree_valid

But this assertion would only take affect if VTR assertion level is set to 3. I'm thinking about maybe trying it out. If the assertion is never triggered but VPR keeps failing in prune_route_tree, then it's probably not the node pointers and edge pointers themselves going out of bound, but rather (I suspect) something odd in the structure of the tree or how the nodes and edges are deleted during pruning.

@helen-23
Copy link
Contributor Author

helen-23 commented Aug 9, 2020

Found the option that turns route tree pruning off: --min_incremental_reroute_fanout <int>.
I set <int> to 10000 so that it is definitely higher than the largest fanout of any net.

Turns out prune_route_tree_recurr was indeed the first location that crashed due to memory corruption elsewhere. After disabling that function, VPR failed at another location, free_route_tree(t_rt_node*), which recursively frees the nodes and edges in the route tree in a depth-first traversal. Different DLA variants failed for different reasons (same as our previous observations).

DLA_BSC (failed at trying to access attribute of NULL pointer node):
min_incre_reroute_1000

DLA_ELT (failed because address of an edge is misaligned)
min_incre_reroute_1000_elt

@helen-23 helen-23 closed this as completed Aug 9, 2020
@helen-23 helen-23 reopened this Aug 9, 2020
@helen-23
Copy link
Contributor Author

helen-23 commented Aug 10, 2020

Also tried adding assertions to check the validity of the node tree at various locations of the code:
VTR_LOG("Checking if routing skeleton tree is valid..."); VTR_ASSERT(is_valid_skeleton_tree(rt_root));
It turns out the tree was still valid before going into free_route_tree (rt_root), so there should not be any NULL or misaligned pointers in the original tree, otherwise, is_valid_skeleton_tree(rt_root) would have aborted during traversal.

(Wild guess) Is it possible that the input pin equalization somehow introduced multiple edges between the same pair of nodes?

@vaughnbetz
Copy link
Contributor

Nice detective work Helen!

  1. Yes, I think turning the VPR_ASSERT level up to the max and running is a very good idea.
  2. I think checking the rr-graph is a very good idea. Probably the best thing to do is to dump the rr_graph.echo file (see the developer guide for how) and checking the generated rr-graph around the IPINs and SINKs of the M9K RAM (we should only turn on input pin equivalence for it). Should do a small design / device so the file is not so big. Do we properly connect all the IPINs to a new SINK that represents all connections to an M9K RAM? Do we properly set that sink in net_rr_terminals as the node we should route to when routing to an M9K input?

@helen-23
Copy link
Contributor Author

Thanks, Vaughn, sounds like a plan.
I've tried it on a smaller circuit (the dot product circuit I had earlier to test the DLA designs), and it passed with M9K input pin equivalence enabled. This is probably because the design used very few, if any, RAM blocks. Trying it again with DSP input pin equivalence enabled and I'll update again once that's done.

@helen-23
Copy link
Contributor Author

helen-23 commented Aug 10, 2020

Also trying just enabling M9K input pin equivalence on a smaller Titan benchmark.

@tangxifan
Copy link
Contributor

Hi @vaughnbetz

  • During my test, I only set the M9K input port data_addr_control_in to be equivalent. VPR failed in the same way that Helen reported.
  • I have checked the rr_graph through GUI, as you can see the IPIN-to-SINK connection is correct. There are 104 pins in the port data_addr_control_in, and the GUI does show 104 incoming edges to the sink node.
    image

@vaughnbetz
Copy link
Contributor

Thanks Xifan and Helen. Xifan, Helen told me today that she has a candidate bug location identified: it looks like the route_tree code does not properly remove connections the look like:
IPIN->SINK
->SINK
->SINK

and so on. I think this may never have been exercised in the code before because logic blocks (when they have input equivalence on) build a clustered netlist that just brings a signal in once (connects to one input). So there wouldn't be a reason to construct a route tree like this. I suspect the general (RAM, DSP, ...) clustered netlist building code is instead building one connection per input pin to a RAM or DSP block, despite the pins being equivalent. That is legal, so the root bug really is in the router not properly handling this case (i.e. I'd rather fix the router than change the netlist builder, although this might be fixable in the clustered netlist builder as well).

Helen is going to try fixing the route_tree bug and hopefully that will fix this.

@helen-23
Copy link
Contributor Author

helen-23 commented Aug 11, 2020

Thank you, Vaughn, for explaining it all!

Another issue came up though.

For the case that we've observed (as Vaughn mentioned above), here is an example of such tree structure that failed during delete_route_tree() (did print_node_tree() to the log file):

print_tree_segfault

During deletion, a depth-first approach is used to delete each node from the leafs and upwards. Once node 33371 is deleted from the first edge that reaches it, subsequent deletions of node 33371 from the other 10 edges would result in segfault. Same thing would happen if the prune_route_tree() routine is called.

I fixed the delete_route_tree() so that it only removes the first instance of a child node if there are multiple of the same child node under one parent node, and it worked:

print_tree_segfault_fixed

However, as the test went on further, another type of tree structure showed up, where a child node is associated with multiple parent nodes. This is not considered a valid tree structure, thus an assertion error was reported with check_valid_skeleton_tree().:

print_tree_assertion error

@vaughnbetz
Copy link
Contributor

Thanks Helen. I think that is another bug -- a SINK should be allowed to be part of more than one tree branch when it has capacity > 1. So I think relaxing the error check for that case is fine, and hopefully fixes this bug.

@helen-23
Copy link
Contributor Author

helen-23 commented Aug 12, 2020

Thanks, Vaughn,

Fixed delete_route_tree() and prune_route_tree() to only free a node once in the tree.
Also modified verify_traceback_route_tree_equivalent(), and collect_route_tree_connections() to use std::multiset instead of std::set since there are now more multiple tuples containing the same values {prev_node, prev_switch, to_node}

One more concern though: an rt_node struct has a parent_node and a parent_switch member associated with it. If a node is allowed to be part of more than one tree branch, its parent_node member now only points to one of its parents. I had to reconstruct the collect_route_tree_connections() logic because it was making use of the node's parent_node member, which was pointing at the wrong node in some cases. That was an easy fix, but I wasn't sure how to deal with parent_switch. As of now, I am assuming that a node's parent_switch is the same across all of its parent nodes.

My small dot product circuit test finally passed most of routing with input pin equivalence applied to the DSP blocks, but failed with the following error:

Message: in check_sink: node 33371 does not connect to any terminal of net dot_product:generate_dsps[0].DP|dsp_block:generate_dsps[0].Dlast|datab[31] #7.
This error is usually caused by incorrectly specified logical equivalence in your architecture file.
You should try to respecify what pins are equivalent or turn logical equivalence off.

I'll run some other titan circuits that use M9Ks and only apply pin equivalence to the M9K blocks, and see how that goes with my code modifications. I'm still worried if the parent node issue might cause problems in other parts of the logic. What would you recommend as the best way to deal with them?

@helen-23
Copy link
Contributor Author

@vaughnbetz
Copy link
Contributor

Hi Helen,

I can't see that branch. Maybe it isn't pushed to github yet (i.e. is only local on your machine)?

For the non-tree errors:

I think there are two basic ways we could attack it.

  1. We could try to special case sinks, putting the same rt_node into the tree multiple times when we reach a SINK multiple times in a single route. I think this way is likely going to become problematic, as you can't have multiple parent_ pointers etc. from the single rt_node, so we can't really make the rt_tree consistent.

  2. Right now, we will have two separate rt_nodes in a route tree if we go to the same sink twice. The rt_nodes will (in this logical equivalence case) have the same data (rr_node, parent_node, parent_switch), but there will be two of them. So that is fine. There is one problem I can see in that case: we have a mapping from an rr_node (index) to a (single) rt_node:

    rr_node_to_rt_node[inode] = sink_rt_node;

This mapping is only used in route_tree_timing.cpp to rapidly figure out where rt_nodes that we are branching off of (for nets with fanout > 1) are, so we can quickly attach the routing of a new connection to the route_tree. We'll never branch off a SINK (it is the end of a connection) so storing only one of the rr_node -> rt_node mappings for a SINK (if the SINK is inserted multiple times) should be OK. But that should be commented in this file.

If that wasn't OK we could change the mapping to store a vector or list for each entry, but I think we don't need to (can just update the comment).

I think with this approach (#2) we can still use the parent pointers, parent_switch info etc. in the rt_tree so that should be OK.

@vaughnbetz
Copy link
Contributor

vaughnbetz commented Aug 12, 2020

I think there is a bug in check_sink. It should be marking netlist pins as sinks that could match them are found, but the code isn't correct if you can connect to the same sink more than once.

I've fixed the code below (haven't compiled or tested though, but hopefully this works). Code is also simpler :).

// Check all the pins on this net, and find a pin that has not yet been reached that could match this SINK.
// Mark that found pin as done in the pin_done array.
     for (auto pin_id : cluster_ctx.clb_nlist.net_sinks(net_id)) {
           int pin_index = tile_pin_index(pin_id);
           int iclass = type->pin_class[pin_index];
           if (iclass == ptc_num) {
                    /* Could connect to same pin class on the same cluster-level block more than once.  Only   
                     * update pin_done for a pin that hasn't been reached yet.                 
                     */
                  if (pin_done[pin_id] == false) {
                      ifound++;
                      pin_done[ipin] = true;
                      break;   
                  }
            }
    }

While you are at it, the code below looks old and out of date, and should be replaced with an
VTR_ASSERT (ifound <= 1);

// Delete this code and replace with VTR_ASSERT (ifound <= 1);
    if (ifound > 1 && is_io_type(type)) {
        VPR_FATAL_ERROR(VPR_ERROR_ROUTE,
                        "in check_sink: found %d terminals of net %d of pad %d at location (%d, %d).\n", ifound, size_t(net_id), ptc_num, i, j);
    }

@helen-23
Copy link
Contributor Author

helen-23 commented Aug 12, 2020

Hi Vaughn,

Thank you for the detailed explanation! Yes I can see how #2 would work the best. Currently working on it and will update whether it works or not.

I think I got the link to my branch wrong, it should be the following:
https://github.com/verilog-to-routing/vtr-verilog-to-routing/tree/allow_ram_dsp_input_equivalence

Now, I'm noticing a bunch of regtest failures in this branch due to items like:
warning: no previous declaration for ‘void pathfinder_update_cost_from_route_tree(const t_rt_node*, int, float)’

I didn't touch that function so I'm not sure what's going on. I'll have to look into that later as well.

@vaughnbetz
Copy link
Contributor

We have regtests that set compiler warnings as errors so somebody gets rid of the warnings. Not sure if the declaration for that function was accidentally deleted (which would trigger the warning)?

@helen-23
Copy link
Contributor Author

helen-23 commented Aug 13, 2020

Hi Vaughn,
I tried what you suggested, but although VPR passed for the dot product circuit, many of the regtests still failed for the same reason.
An example from one of the regtests is the following:
Message: in check_sink: node 3840 does not connect to any terminal of net diffeq_paj_convert^x_var~0_FF #36
The net has 2 fanouts.

Here's part of my code:

    // Check all the pins on this net, and find a pin that has not yet been reached that could match this SINK. 
    // Mark that found pin as done in the pin_done set. 
    for (auto pin_id : cluster_ctx.clb_nlist.net_sinks(net_id)) {
          int pin_index = tile_pin_index(pin_id);
          int iclass = type->pin_class[pin_index];
          if (iclass == ptc_num) {
                /* Could connect to same pin class on the same cluster-level block more than once.  Only   
                * update pin_done for a pin that hasn't been reached yet.                 
                */
                if (pin_done.find(pin_id) == pin_done.end()) {
                      pin_done.insert(pin_id);
                      ifound++;
                      break;   
                 }
           }
    }

In the caller function, I made pin_done a set and inserted the first net in the set, which follows the logic of the original code
(pin_done[0]=true;):
std::set<vtr::StrongId<cluster_pin_id_tag>> pin_done;

auto first_net_id = cluster_ctx.clb_nlist.net_pins(net_id).begin();
pin_done.insert(*first_net_id);

And for checking whether net is connected to pins:

 for (auto pin_id : cluster_ctx.clb_nlist.net_pins(net_id)) {
         if (pin_done.find(pin_id) == pin_done.end()) {
                int pin_index = tile_pin_index(pin_id);
                VPR_FATAL_ERROR(VPR_ERROR_ROUTE,
                                "in check_route: net %zu does not connect to pin %d.\n", size_t(net_id), pin_index);
         }
 }

@tangxifan
Copy link
Contributor

Thanks @vaughnbetz for the explanation. I understand better where the bug comes from.
Currently, we are forcing a greedy and false pin-equivalence to the M9K or DSP inputs.
In a fully equivalent port, each pin should be mapped to a unique port.
But now, the local routing of M9K is partially fully connected, which the logic block router cannot uniquify nets and then map the same net to 2+ pins in a port.
If this is the case, I can try to help Helen in improving the Titan architecture by creating the true equivalent ports.
This may bypass the problem we see here.

@vaughnbetz
Copy link
Contributor

Thanks Xifan. That makes sense. In parallel, I'd like to get this bug fixed too, since I think it is still a bug (but I agree, modifying the architecture file would change the netlist so that this case doesn't happen for the architecture we're creating).

Helen, that is odd the code doesn't work -- it looked so good :). One alternative is to not check the driver at all, by changing the check loop to only go over sinks:
for (auto pin_id : cluster_ctx.clb_nlist.net_sinks(net_id))

But I am not sure as I thought your code above would work too!

@helen-23
Copy link
Contributor Author

helen-23 commented Aug 14, 2020

Hi Vaughn,
I think I might have figured out why our code didn't work. In the original code, it loops through all the blocks associated with the SINK node's xlow() and ylow(), and it loops through all the sink pins on the net. Only sink pins that are associated with the blocks are checked to see if they could match the SINK node.

The total number of sink pins on some nets are actually more than the number of those associated with the appropriate blocks, so after removing this loop
for (int iblk = 0; iblk < type->capacity; iblk++)
and this check
if (cluster_ctx.clb_nlist.pin_block(pin_id) == bnum)
Perhaps some pins that were not supposed to be taken in an earlier iteration ended up taken (pin_done=true) because they were allowed to be checked.

I tested this hypothesis on the master branch (without any of the route_tree_timing.cpp changes), and printed out the total number of sink pins vs number of sink pins that were actually checked:
Total number of sink pins: 2 Number of sink pins checked: 1

When I removed the two lines. The same failures occurred in master.

Bringing back the loop and the check, the following code actually worked for all regtests and all our tests used for testing pin equivalence:

for (int iblk = 0; iblk < type->capacity; iblk++) {
        ClusterBlockId bnum = place_ctx.grid_blocks[i][j].blocks[iblk]; /* Hardcoded to one cluster_ctx block*/
        unsigned int ipin = 1;
        for (auto pin_id : cluster_ctx.clb_nlist.net_sinks(net_id)) {
            if (cluster_ctx.clb_nlist.pin_block(pin_id) == bnum) {
                int pin_index = tile_pin_index(pin_id);
                int iclass = type->pin_class[pin_index];
                if (iclass == ptc_num) {
                    /* Could connect to same pin class on the same clb more than once.  Only   *
                     * update pin_done for a pin that hasn't been reached yet.                 */
                    if (pin_done[ipin] == false) {
                        ifound++;
                        pin_done[ipin] = true;
                        break;
                    }
                }
            }
            ipin++;
        }
    }

@tangxifan
Copy link
Contributor

@helen-23 I just checked these codes and I think you are right.

  • A net may have multiple sinks and not all the sinks will end at the same block (it may reach different blocks/grids). So I believe that
  if (cluster_ctx.clb_nlist.pin_block(pin_id) == bnum) {

is necessary.

  • Regarding to the block capacity, you have to be careful. The capacity indicates how many blocks can be mapped to a grid but in practice a grid may not be fully mapped. So if you walk through the grid_block using the capacity, you may reach some empty blocks. You can check out the code about the grid_block here. I would suggest you just walk through the vector using iterators or range-based loops but rather than the capacity.

@helen-23
Copy link
Contributor Author

Thank you @tangxifan. Ah I see. I've replaced

for (int iblk = 0; iblk < type->capacity; iblk++) {
        ClusterBlockId bnum = place_ctx.grid_blocks[i][j].blocks[iblk];

with the following:
for (auto bnum : place_ctx.grid_blocks[i][j].blocks) {

@vaughnbetz
Copy link
Contributor

vaughnbetz commented Aug 14, 2020

I think another factor in the code above working is that it isn't really checking that we find a sink matching each pin in the net; it just checks that we find N sinks for a fanout N net, and each sink appears to connect to a block that is reasonable (matches some pin on this net). That's a weaker check than the one I was trying to code: match a sink to exactly one pin on the net, and then don't match a future pin to that same net.

The ipin code in both the above and the original degenerates to a simple count of sinks I believe, as it just fills in the next entry in pin_done each time. It would be better to replace pin_done and ipin with a set over pin_id as it would give us the test that each sink implements a different net connection. Unfortunately, I think that's the code that's failing, for reasons I don't fully understand (maybe the iclass check is failing sometimes? That check is a bit less fundamental.).

So I suggest before leaving this one:

  1. Try the set with this code -- if it works great! If not, perhaps commenting out the iclass = ptc_num check works?
  2. If we can't make the set work, we can simplify the code above by getting rid of pin_done and just counting how many sinks are connected and ensuring its the right number. The ipin/pin_done code is a slower and more complicated bit of code that degenerates to a count (it didn't originally, but has clearly been changed to adapt to the more complex device grid and new netlist and the check degenerated to a count).

@helen-23
Copy link
Contributor Author

Hi @vaughnbetz
Ohh okay, right, I see what you mean by the ipin being an issue. Yes I agree that changing pin_done to a set would make more sense. I guess when I tried different things around I accidentally reverted back the set I had before.

I've changed pin_done back to a set and it worked for the regtests and the two small test designs (dot product & Carpat).

for (auto bnum : place_ctx.grid_blocks[i][j].blocks) {
        for (auto pin_id : cluster_ctx.clb_nlist.net_sinks(net_id)) {
            if (cluster_ctx.clb_nlist.pin_block(pin_id) == bnum) {
                int pin_index = tile_pin_index(pin_id);
                int iclass = type->pin_class[pin_index];
                if (iclass == ptc_num) {
                    /* Could connect to same pin class on the same clb more than once.  Only   *
                     * update pin_done for a pin that hasn't been reached yet.                 */
                    if (pin_done.find(pin_id) == pin_done.end()) {
                        ifound++;
                        pin_done.insert(pin_id);
                        break;
                    }
                }
            }
        }
    }

@vaughnbetz
Copy link
Contributor

OK, I like that code better, thanks :).
Just to double check though: taking out the for loop over bnum and the if statement testing bnum fails? I.e. deleting these lines:
for (auto bnum : place_ctx.grid_blocks[i][j].blocks) {
and
if (cluster_ctx.clb_nlist.pin_block(pin_id) == bnum) {

makes the test fail? If so, baffling, but I'll take it :). Without those lines I think I understand everything; still can't figure out why they are needed.

@tangxifan
Copy link
Contributor

@vaughnbetz No worries. I agree that this is a bug that should be fixed as well. FYI, I have adapted the Titan architecture XML to grant DSP input and M144K output pin equivalence. You can find detailed results in the pull request #1448

@helen-23
Copy link
Contributor Author

Hi @vaughnbetz

Yes, confirming (double checked to make sure) that with the modified code, removing those two lines still makes the test fail.

With regards to the issue we discussed on Friday, where VPR failed to route the DLA circuits due to some OPIN-type nodes having a net delay of 0, the problem is actually not in the route tree net delay updates. In fact, when it is checking for SINK nodes that have a net delay of 0, OPIN nodes shouldn't even show up as part of the check.

The vector remaining_targets is used to store the IDs of remaining SINK nodes awaiting to be reached during pruning. All node IDs in the vector are then converted into corresponding SINK pin IDs through
connections_inf.convert_sink_nodes_to_net_pins(remaining targets) in route_timing.cpp.

It then loops through all remaining targets (for each target_pin in remaining_targets) and updates the rt_node_of_sink array through
rt_node_of_sink[target_pin] = update_route_tree(&cheapest, ((high_fanout) ? &spatial_rt_lookup : nullptr));

Problem is that all repeated nodes have the same node ID, so when node IDs are converted into pin IDs, the same pin ID would appear multiple times in remaining_targets ( e.g. remaining_targets = [1,1,3,3,5,5] ). Consequently, when rt_node_of_sink[target_pin] is updated, the same element could be updated multiple times while others are not updated at all. As a result, the elements that are not updated could be storing some random pin IDs. This is why when we reached some random OPIN nodes when we are checking through all SINK pins:

for (unsigned ipin = 1; ipin < cluster_ctx.clb_nlist.net_pins(net_id).size(); ++ipin) {
      if (net_delay[ipin] == 0) { // should be SOURCE->OPIN->IPIN->SINK
             VTR_ASSERT(device_ctx.rr_nodes[rt_node_of_sink[ipin]->parent_node->parent_node->inode].type() == OPIN);
      }
}

This means that although the smaller circuits passed, it's likely that they passed by chance (i.e. only random SINK nodes were stored in the elements that were not updated so that even though it was wrong, it did not trigger the assertion). Currently trying out fix for this.

@tangxifan Thank you for the fix in the architecture! This is super helpful as it would bypass any issues that could arise with multiple copies of the same nodes.

@helen-23
Copy link
Contributor Author

helen-23 commented Aug 18, 2020

Actually, the smaller circuits passed because they have very low fanout (less than min_incremental_reroute_fanout) and no pruning is required. In that case, remaining_targets is populated simply through:

for (unsigned int sink_pin = 1; sink_pin <= num_sinks; ++sink_pin)
        connections_inf.toreach_rr_sink(sink_pin);

No node-to-sink conversion is needed in this case, so there's no problem. For larger circuits like DLA, however, we would want to prune the tree. During pruning, remaining_targets is first populated by node IDs and then converted into pin IDs. This is where the problem occurs.

@vaughnbetz
Copy link
Contributor

Update: Helen is working on a fix where she stores the net_pin (connection id) of each connection on the route tree and traceback, so there is no ambiguity.
Xifan is working on a more complete partial crossbar architecture that will hopefully also avoid this issue (won't connect to the same sink twice), but both fixes are useful and proceeding.
Also, Aman at UT Austin appears to be hitting a similar issue.

@aman26kbm
Copy link
Contributor

I see a segmentation fault when routing starts, when I add a sparse 50% populated input crossbar to a hardblock in the architecture. This is the same observation that Helen and Xifan have, as discussed in this issue.

There are 4 main blocks in the arch I have: CLB, DSP, BRAM and matmul (custom hardblock I've added). CLB already had the crossbar. I added a crossbar to the other 3. The crossbar to DSP works. I don't see the segmentation fault with that. But when I add a crossbar to either the BRAM or the matmul, I get the segmentation fault.

I don't understand a lot of the VPR internals that are discussed here. But having different cases that fail and pass may be helpful in the debug process. So, I am attaching my arch files and the design.

issue_1475_aman_utaustin.zip

The key for the names of the arch file is: "_yes" means that that block has a local crossbar specified. "_no" means that that block doesn't have a local crossbar specified. The arch "stratix10.matmul_no.dsp_yes.bram_no.xml" is the one that doesn't run into the segmentation fault. The other two do.

@aman26kbm
Copy link
Contributor

A couple more things...

Since there's some discussion on different ways of expressing the crossbar... if you want to see the crossbars in my arch files, best thing to do is to diff stratix10.matmul_no.dsp_yes.bram_no.xml file with the other two.

And.. If you guys have a temp fix, please let me know. I can try to get that in my fork, and build and run and see how it goes.

@helen-23 helen-23 mentioned this issue Aug 27, 2020
7 tasks
@helen-23 helen-23 linked a pull request Aug 27, 2020 that will close this issue
7 tasks
@aman26kbm
Copy link
Contributor

I have a question about working around this issue until it is fixed.

For now, I am specifying the connections using "direct" instead of "complete" and specifying the local crossbar delay (obtained from COFFE) in those "direct" interconnect statements. That feels like paying double penalty (router will have difficulty routing because there is no local crossbar AND the delay specified is reasonably high).

@vaughnbetz , is there a better way to model this until this issue is fixed? or is what I am doing okay.

@vaughnbetz
Copy link
Contributor

Hi Aman,

I think that's overly conservative; I would use either the lower delay and no pin equivalence or the higher delay and pin equivalence.
I have just merged in the fix for pin equivalence on DSP and RAM blocks. This makes the quick and dirty pin equivalence where you just mark pins as equivalent without putting in the details of the crossbar in the arch file work. Putting the details in the arch file already should have worked.

For large blocks we have found another issue though -- the router lookahead is always heading to the lower-left corner of the block, making routing to other pins suboptimal. Making pins that are in distinct locations on a large block (different x,y locations on the block) seems to exacerbate this by giving the router more flexibility. Xifan is working on a fix for this, but it will take a while. In the interim, if you see poor routing to large blocks you can use a lower astar_fac (e.g. 0.75 or perhaps even 0.5) but it will slow the router down. See #1508 for some discussion on this (near the end).

@aman26kbm
Copy link
Contributor

Thanks, @vaughnbetz !

@aman26kbm
Copy link
Contributor

Coming back to this.. I am still seeing a segmentation fault. I tried on the latest master today. Note that I am providing the full details of the crossbar.

I am pasting here snapshots of the the differences in two arch files. Left one has no local crossbar on the memory block. Right one has local crossbar on the memory block.

When I use the left one, things work. When I use the right one, I see a seg fault. Am I doing something incorrectly?

In these arch files, I have local crossbar on the DSP block in the same way and that works fine.

first
second
third
fourth

@aman26kbm
Copy link
Contributor

I posted it on the VTR user group as well (https://groups.google.com/g/vtr-users/c/3Fcuo4r4wz4), in case someone else sees it. Please respond there if you can. Appreciate any help.

@vaughnbetz
Copy link
Contributor

vaughnbetz commented Sep 7, 2020 via email

@aman26kbm
Copy link
Contributor

Thanks, @vaughnbetz

@tangxifan , it'll be very helpful if you can share your XML arch with the crossbar description.

@tangxifan
Copy link
Contributor

@aman26kbm
You can find the LINE 12165 in the XML about how I create crossbars for part of DSP inputs.
Hope that it can help.

stratixiv_arch_DSP_pin_eq.timing.xml.zip

@aman26kbm
Copy link
Contributor

Thanks, Xifan. I looked at it and it is very similar to how I'm doing it. The weird thing is that it works for DSP slice. But it doesn't work for memory (and for another hard block I am adding to the architecture). So somehow the crossbar on memory is triggered some specific code path in VPR that has a bug.

I am attaching a zip file that has a arch and a design file. If you run them, you'll see the segmentation fault.

Desktop.zip

@aman26kbm
Copy link
Contributor

@vaughnbetz , regarding this point that you had mentioned:

I think that's overly conservative; I would use either the lower delay and no pin equivalence or the higher delay and pin equivalence.

I have some confusion here. Wanted to enumerate the options we have to get some clarity.

Option#1:
Express the crossbar (using "complete" keyword) + specify delay of the local mux from COFFE (using the "delay_constant" tag) + specify pin equivalence ("equivalent = full") -> This is the right thing, but is crashing.

Option#2:
No crossbar specification (use "direct" keyword) + specify delay of the local mux from COFFE for these direct wires (using the "delay_constant" tag) + specify pin equivalence ("equivalent = full") -> I think this is the quick and dirty thing mentioned above. Haven't tried this, may crash or may work.

Option #3:
No crossbar specification (use "direct" keyword) + specify delay of the local mux from COFFE + no pin equivalence -> This is the overly conservative thing mentioned above.

Option#4:
No crossbar specification (use "direct" keyword) + 0 delay + no pin equivalence.
A variation of option #4 is (option 4.5):
No crossbar specification (use "direct" keyword) + some fraction of the local mux delay obtained from coffe + no pin equivalence.

Do you have a recommendation of what the reasonable thing to do is, for now, until the crash issue is fixed?

@vaughnbetz
Copy link
Contributor

vaughnbetz commented Sep 9, 2020 via email

@aman26kbm
Copy link
Contributor

Thanks. I'll try #2 tomorrow to see what happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants