Skip to content

Placer reporting cost errors during consistency checks #1402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
litghost opened this issue Jul 6, 2020 · 45 comments
Closed

Placer reporting cost errors during consistency checks #1402

litghost opened this issue Jul 6, 2020 · 45 comments
Labels

Comments

@litghost
Copy link
Collaborator

litghost commented Jul 6, 2020

Expected Behaviour

Placer places circuit and writes placement file.

Current Behaviour

Placer aborts with the following error:

BB estimate of min-dist (placement) wire length: 310
Error 1: timing_cost_check: 1.75857e-08 and timing_cost: 1.17164e-09 differ in check_place.
Incr Slack updates 1 in 7.5417e-05 sec
Full Max Req/Worst Slack updates 1 in 1.8772e-05 sec
Incr Max Req/Worst Slack updates 0 in 0 sec
Incr Criticality updates 0 in 0 sec
Full Criticality updates 1 in 1.5237e-05 sec
# Placement took 0.23 seconds (max_rss 673.7 MiB, delta_rss +0.0 MiB)
Error 2: 
Type: Placement
File: vtr-verilog-to-routing/vpr/src/place/place.cpp
Line: 2689
Message: 
Completed placement consistency check, 1 errors found.
Aborting program.

Possible Solution

Working on reduced test case. Circuit in this case was very simple, multiple inpad -> outpad connections.

Steps to Reproduce

1.Get input test data:

  • wget https://storage.googleapis.com/debug_data/vpr_failing_test.tar.bz2
  • tar -xf vpr_failing_test.tar.bz2
  1. Build Symbiflow VPR at c62dcaf
  • git clone https://github.com/litghost/vtr-verilog-to-routing.git
  • cd vtr-verilog-to-routing
  • git checkout -b test
  • git reset --hard 87c4d426a5096dc6b8ab7e9adb0c6a39b2154bd2
  • mkdir build
  • cd build
  • cmake -DVTR_ASSERT_LEVEL=3 ..
  • make -j
  1. Run failing test case:
  • cd ../../vpr_failing_test
  • cp ../vtr-verilog-to-routing/vpr/vpr .
  • ./run.sh

Context

Your Environment

  • VTR revision used:
  • Operating System and version:
  • Compiler version:
@litghost
Copy link
Collaborator Author

litghost commented Jul 6, 2020

I'm working on identifying when this abort started happening. @acomodi believes it was around when the incremental timing analysis was added.

@litghost
Copy link
Collaborator Author

litghost commented Jul 6, 2020

I tested a circuit that was a little more complicated (a counter), rather than simple straight inport -> outport, same issue. I'll see how much I can reduce the circuit and still produce the error.

@HackerFoo
Copy link
Contributor

I also ran into this issue when working on #1205. Disabling incremental timing avoided the error.

@litghost
Copy link
Collaborator Author

litghost commented Jul 7, 2020

I also ran into this issue when working on #1205. Disabling incremental timing avoided the error.

Which graph were you running on? So far I've noticed the placer error showing up on 7-series ROI based graphs, but I would've assumed you would've been testing on either full 7-series graphs or a titan arch.

@litghost
Copy link
Collaborator Author

litghost commented Jul 7, 2020

@acomodi mentioned this, but if the ASSERT level is increased from 2 to 3 the error changes to:

vtr-verilog-to-routing/vpr/src/place/place.cpp:1963 comp_td_connection_cost: Assertion 'ipin > 0' failed (Shouldn't be calculating connection timing cost for driver pins).

@acomodi
Copy link
Collaborator

acomodi commented Jul 7, 2020

An additional note, the function that provides the driver pin, which causes this assertion error is the following:

static void update_td_costs(const PlaceDelayModel* delay_model, const PlacerCriticalities& place_crit, double* timing_cost) {
/* NB: We must be careful calculating the total timing cost incrementally,
* due to limitd floating point precision, so that we get a
* bit-identical result matching that calculated by comp_td_costs().
*
* In particular, we can not simply calculate the incremental
* delta's caused by changed connection timing costs and adjust
* the timing cost. Due to limited precision, the results of
* floating point math operations are order dependant and we
* would get a different result.
*
* To get around this, we calculate the timing costs hierarchically
* to ensures we calculate the sum with the same order of operations
* as comp_td_costs().
*
* See PlacerTimingCosts object used to represent connection_timing_costs
* for details.
*/
vtr::Timer t;
auto& cluster_ctx = g_vpr_ctx.clustering();
auto& clb_nlist = cluster_ctx.clb_nlist;
//Update the modified pin timing costs
{
vtr::Timer timer;
auto clb_pins_modified = place_crit.pins_with_modified_criticality();
for (ClusterPinId clb_pin : clb_pins_modified) {
if (clb_nlist.pin_type(clb_pin) == PinType::DRIVER) continue;
ClusterNetId clb_net = clb_nlist.pin_net(clb_pin);
VTR_ASSERT_SAFE(clb_net);
if (cluster_ctx.clb_nlist.net_is_ignored(clb_net)) continue;
int ipin = clb_nlist.pin_net_index(clb_pin);
VTR_ASSERT_SAFE(ipin >= 0 && ipin < int(clb_nlist.net_pins(clb_net).size()));
double new_timing_cost = comp_td_connection_cost(delay_model, place_crit, clb_net, ipin);
//Record new value
connection_timing_cost[clb_net][ipin] = new_timing_cost;
}
f_update_td_costs_connections_elapsed_sec += timer.elapsed_sec();
}
//Re-total timing costs of all nets
{
vtr::Timer timer;
*timing_cost = connection_timing_cost.total_cost();
f_update_td_costs_sum_nets_elapsed_sec += timer.elapsed_sec();
}
#ifdef VTR_ASSERT_DEBUG_ENABLED
double check_timing_cost = 0.;
comp_td_costs(delay_model, place_crit, &check_timing_cost);
VTR_ASSERT_DEBUG_MSG(check_timing_cost == *timing_cost,
"Total timing cost calculated incrementally in update_td_costs() is "
"not consistent with value calculated from scratch in comp_td_costs()");
#endif
f_update_td_costs_total_elapsed_sec += t.elapsed_sec();
}

@HackerFoo
Copy link
Contributor

@litghost I was using the full 50t graph. I assumed that the problem was triggered by my new placement code.

@litghost
Copy link
Collaborator Author

litghost commented Jul 7, 2020

@litghost I was using the full 50t graph. I assumed that the problem was triggered by my new placement code.

No I don't believe your new placement code along trigger this issue. However in my limited testing, the full 50T graph did not produce this error, so it possible that ROI or full 50T + new placement triggers the issues. There is some kind of latent issue being uncovered here.

@litghost
Copy link
Collaborator Author

litghost commented Jul 8, 2020

I did a quick test with HackerFoo@7e35d15 and the counter test on the xc7 50T ROI arch was able to complete if assert level was 2. On assert level 3, I got:

vtr-verilog-to-routing/vpr/src/place/place.cpp:1966 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == comp_td_connection_delay(delay_model, net, ipin)' 
failed (Connection delays should already be updated).  

WIll continue investigating this week.

@litghost
Copy link
Collaborator Author

litghost commented Jul 9, 2020

@kmurray Why is the assertion at

VTR_ASSERT_SAFE_MSG(ipin > 0, "Shouldn't be calculating connection timing cost for driver pins");
valid? In the parent context DRIVER pin types are ignored, see here: https://github.com/verilog-to-routing/vtr-verilog-to-routing/blame/master/vpr/src/place/place.cpp#L1889

I'm not clear why ipin == 0 is an assertion violation?

@litghost
Copy link
Collaborator Author

litghost commented Jul 9, 2020

I did a quick test with HackerFoo@7e35d15 and the counter test on the xc7 50T ROI arch was able to complete if assert level was 2. On assert level 3, I got:

vtr-verilog-to-routing/vpr/src/place/place.cpp:1966 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == comp_td_connection_delay(delay_model, net, ipin)' 
failed (Connection delays should already be updated).  

WIll continue investigating this week.

The assertion from here should probably be behind the incremental #ifdef.

@vaughnbetz
Copy link
Contributor

The assumption that pin #0 on a net is the driver is deeply embedded in the netlist data structures. So the assert looks valid.
I'd run the sanitizers to check memory faults, since this is a strange failure. I agree that this assertion doesn't look like it should fire based on the code, so something strange is going on (memory corruption? bad data in the architecture file?).

@vaughnbetz
Copy link
Contributor

@Bill-hbrhbr this may be of interest; hard to debug without being able to run Symbiflow though.

@litghost
Copy link
Collaborator Author

litghost commented Jul 9, 2020

@vaughnbetz / @Bill-hbrhbr I've added instructions for replicating our failure. I can provide more blif's if that is useful, however the current one is just a single input / output.

@vaughnbetz
Copy link
Contributor

@Bill-hbrhbr : Keith has provided instructions on how to reproduce at the top of this issue -- can you take a look?

@Bill-hbrhbr
Copy link
Contributor

The inherent rule is that the criticalities, delays, and timing costs are only calculated for sink pins, not the driver pins. All the data structures (or at least for Kevin's incremental sta code) are based on this fact, including the PlacerTimingCosts's heap in timing_place.h

An additional note, the function that provides the driver pin, which causes this assertion error is the following:

static void update_td_costs(const PlaceDelayModel* delay_model, const PlacerCriticalities& place_crit, double* timing_cost) {
/* NB: We must be careful calculating the total timing cost incrementally,
* due to limitd floating point precision, so that we get a
* bit-identical result matching that calculated by comp_td_costs().
*
* In particular, we can not simply calculate the incremental
* delta's caused by changed connection timing costs and adjust
* the timing cost. Due to limited precision, the results of
* floating point math operations are order dependant and we
* would get a different result.
*
* To get around this, we calculate the timing costs hierarchically
* to ensures we calculate the sum with the same order of operations
* as comp_td_costs().
*
* See PlacerTimingCosts object used to represent connection_timing_costs
* for details.
*/
vtr::Timer t;
auto& cluster_ctx = g_vpr_ctx.clustering();
auto& clb_nlist = cluster_ctx.clb_nlist;
//Update the modified pin timing costs
{
vtr::Timer timer;
auto clb_pins_modified = place_crit.pins_with_modified_criticality();
for (ClusterPinId clb_pin : clb_pins_modified) {
if (clb_nlist.pin_type(clb_pin) == PinType::DRIVER) continue;
ClusterNetId clb_net = clb_nlist.pin_net(clb_pin);
VTR_ASSERT_SAFE(clb_net);
if (cluster_ctx.clb_nlist.net_is_ignored(clb_net)) continue;
int ipin = clb_nlist.pin_net_index(clb_pin);
VTR_ASSERT_SAFE(ipin >= 0 && ipin < int(clb_nlist.net_pins(clb_net).size()));
double new_timing_cost = comp_td_connection_cost(delay_model, place_crit, clb_net, ipin);
//Record new value
connection_timing_cost[clb_net][ipin] = new_timing_cost;
}
f_update_td_costs_connections_elapsed_sec += timer.elapsed_sec();
}
//Re-total timing costs of all nets
{
vtr::Timer timer;
*timing_cost = connection_timing_cost.total_cost();
f_update_td_costs_sum_nets_elapsed_sec += timer.elapsed_sec();
}
#ifdef VTR_ASSERT_DEBUG_ENABLED
double check_timing_cost = 0.;
comp_td_costs(delay_model, place_crit, &check_timing_cost);
VTR_ASSERT_DEBUG_MSG(check_timing_cost == *timing_cost,
"Total timing cost calculated incrementally in update_td_costs() is "
"not consistent with value calculated from scratch in comp_td_costs()");
#endif
f_update_td_costs_total_elapsed_sec += t.elapsed_sec();
}

Indeed, the bug appeared since place_crit.pins_with_modified_criticality() provided a driver pin in the failed test case and had its timing cost computed, which messed up the heap data structure for calculating the total timing cost (sum of all sink pins' costs).

This effect can be generated by a lot of different test cases, and it also depends on how the make_range function orders the elements in the vtr_id_set structure (which in turn affects how the heap is updated).

This should not happen. I suspect that the cause is within the following function:

void PlacerCriticalities::update_criticalities(const SetupTimingInfo* timing_info, float crit_exponent) {
/* Performs a 1-to-1 mapping from criticality to timing_place_crit_.
* For every pin on every net (or, equivalently, for every tedge ending
* in that pin), timing_place_crit_ = criticality^(criticality exponent) */
//Determine what pins need updating
if (INCR_UPDATE_CRITICALITIES) {
cluster_pins_with_modified_criticality_.clear();
if (crit_exponent != last_crit_exponent_) {
//Criticality exponent changed, must re-calculate *all* criticalties
auto pins = clb_nlist_.pins();
cluster_pins_with_modified_criticality_.insert(pins.begin(), pins.end());
//Record new criticality exponent
last_crit_exponent_ = crit_exponent;
} else {
//Criticality exponent unchanged
//
//Collect the cluster pins which need to be updated based on the latest timing
//analysis
//
//Note we use the set of pins reported by the *timing_info* as having modified
//criticality, rather than those marked as modified by the timing analyzer.
//Since timing_info uses shifted/relaxed criticality (which depends on max
//required time and worst case slacks), additional nodes may be modified
//when updating the atom pin criticalities.
for (AtomPinId atom_pin : timing_info->pins_with_modified_setup_criticality()) {
ClusterPinId clb_pin = pin_lookup_.connected_clb_pin(atom_pin);
//Some atom pins correspond to connections which are completely
//contained within a cluster, and hence have no corresponding
//clustered pin.
if (!clb_pin) continue;
cluster_pins_with_modified_criticality_.insert(clb_pin);
}
}
} else {
//Non-incremental: all pins and nets need updating
auto pins = clb_nlist_.pins();
cluster_pins_with_modified_criticality_.insert(pins.begin(), pins.end());
}
//Update the effected pins
for (ClusterPinId clb_pin : cluster_pins_with_modified_criticality_) {
ClusterNetId clb_net = clb_nlist_.pin_net(clb_pin);
int pin_index_in_net = clb_nlist_.pin_net_index(clb_pin);
float clb_pin_crit = calculate_clb_net_pin_criticality(*timing_info, pin_lookup_, clb_pin);
/* The placer likes a great deal of contrast between criticalities.
* Since path criticality varies much more than timing, we "sharpen" timing
* criticality by taking it to some power, crit_exponent (between 1 and 8 by default). */
timing_place_crit_[clb_net][pin_index_in_net] = pow(clb_pin_crit, crit_exponent);
}
}

Specifically, this line

cluster_pins_with_modified_criticality_.insert(pins.begin(), pins.end()); 

created the problem. Only sink pins should be added according to the logic of timing cost computation. I suppose the logic is the same for non-incremental timing cost computation, so both line #42 and line #72 need to be modified. Regarding how disabling the incremental timing will bypass this bug, I'm not so sure about the cause, so correct my reasonings above if they are wrong.

I implemented a fix in my local repo, which only adds sink pins to be computed for criticalities. It made the bug go away and still passed vtr_reg_strong tests. I'll open a PR upon the request of anyone in this thread.

@vaughnbetz
Copy link
Contributor

Nice work finding this so quickly Bill!
I agree that code looks wrong. Perhaps this is also the underlying reason Kevin had to implement a more complex summation (tree summation) when he added incremental timing analysis, in order to get the timing costs close enough to pass the incremental vs. brute-force timing cost consistency check (i.e. perhaps the new summation order masked this bug for the test cases run).
Should definitely be fixed. Please do a QoR run with your fixed code and add the results (spreadsheets of QoR vs. master for vtr and titan designs). Assuming they look good, you should open a PR and fix this.

@litghost
Copy link
Collaborator Author

I'll open a PR upon the request of anyone in this thread.

If you have a fix, please create a PR as soon as possible. This bug is currently blocking downstream VTR + symbiflow integration, and it would be good to confirm if your fix works or not.

@vaughnbetz
Copy link
Contributor

Whoops, fat finger. Bill has a fix.

@Bill-hbrhbr
Copy link
Contributor

Yep working on it right now

@litghost
Copy link
Collaborator Author

litghost commented Jul 10, 2020

@Bill-hbrhbr : We are still seeing:

/tmpfs/src/github/vtr-verilog-to-routing/vpr/src/place/place.cpp:1966 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == comp_td_connection_delay(delay_model, net, ipin)' failed (Connection delays should already be updated).

which I believe you will replicate if you increase the ASSERT level from 2 to 3.

@litghost
Copy link
Collaborator Author

Maybe there needs to be an additional check to the assert that "crit_exponent == last_crit_exponent_ && ..."?

@Bill-hbrhbr
Copy link
Contributor

looking into it right now

@litghost
Copy link
Collaborator Author

litghost commented Jul 10, 2020

The way I see it, the logic that you modified does a brute force retiming if crit_exponent changes. However, if the underlying result is not a function of the crit_exponent, (e.g. 0 ^ crit_exponent => 0), then the assertion might fail?

@Bill-hbrhbr
Copy link
Contributor

@litghost I haven't been able to replicate this bug. Is this produced on a different set of benchmarks?

@litghost
Copy link
Collaborator Author

litghost commented Jul 16, 2020

@Bill-hbrhbr I've been able to replicate the failing assertion issue, here are the latest replication instructions:

1.Get input test data:

  • wget https://storage.googleapis.com/debug_data/replicate2.tar.bz2
  • mkdir vpr_failing_test
  • pushd vpr_failing_test
  • tar -xf replicate2.tar.bz2
  • popd
  1. Build Symbiflow VPR at 2d2c0b6
  • git clone https://github.com/litghost/vtr-verilog-to-routing.git
  • cd vtr-verilog-to-routing
  • git checkout -b test
  • git reset --hard 2d2c0b6490c7914383bf326bee973feac455269b
  • mkdir build
  • cd build
  • cmake -DVTR_ASSERT_LEVEL=3 ..
  • make -j
  1. Run failing test case:
  • cd ../../vpr_failing_test
  • cp ../vtr-verilog-to-routing/vpr/vpr .
  • ./run.sh

2d2c0b6 has your earlier fix, so if you change cmake -DVTR_ASSERT_LEVEL=3 .. to cmake -DVTR_ASSERT_LEVEL=2 .. it doesn't assert, and completes without error.

@litghost
Copy link
Collaborator Author

@vaughnbetz / @Bill-hbrhbr I believe this assertion is not in error, but I'm not totally sure. I'm debugging on my side, let me know if you find anything.

@vaughnbetz
Copy link
Contributor

Responding to a prior comment: I don't think this assert has anything to do with crit_exponent. The assert checks that the delay of a connection stored in a certain matrix entry matches what we get if we ask for the delay of that connection right now. We're failing that test. It looks like the routine is called in two places: once by the regular (incremental) timing cost update routine, and once by the brute-force routine that walks through all the connections. Bill, if you set a breakpoint on the assert and trace back to which caller it was, which connection it was (inet, ipin) and how different the stored delay is versus the recomputed one that may shed some light on this. Is there anything strange about this connection?

@litghost
Copy link
Collaborator Author

Is there anything strange about this connection?

Nope! In fact this design is super simple, one inpad and one outpad connected. I working on adding diagnostics prints to see if can find the issue first.

@vaughnbetz
Copy link
Contributor

Does the assert only occur on that design, or also on larger ones? Are both I/Os locked in place (nothing to move)? Maybe there's a corner case if we don't move anything.

Or, if it happens on larger designs too, it's not that ...

@litghost
Copy link
Collaborator Author

Does the assert only occur on that design, or also on larger ones? Are both I/Os locked in place (nothing to move)? Maybe there's a corner case if we don't move anything.

They are not locked. I've already added a print checking a move took place.

@litghost
Copy link
Collaborator Author

Looks like the bug is in driven_by_moved_block. I'll provide a print trace in a minute that points in that direction.

@litghost
Copy link
Collaborator Author

litghost commented Jul 17, 2020

This design has 1 net, with two pins 1 (driver) and 0 (sink). You can see in the trace that the placer only moves one block, but the incremental logic sees that the sink has been affected, but still skips the update because it believes the driver has also moved. Not sure why just yet:

Moves per temperature: 2
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 0
Recomputing 0[1]
Commiting td cost
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 0
Recomputing 0[1]
Commiting td cost

---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
Tnum   Time       T Av Cost Av BB Cost Av TD Cost     CPD       sTNS     sWNS Ac Rate Std Dev  R lim Crit Exp Tot Moves  Alpha
      (sec)                                          (ns)       (ns)     (ns)                                                 
---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 0
Recomputing 0[1]
Commiting td cost
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
Commiting td cost
   1    0.0 8.8e+00   1.370       0.11 1.2145e-09   1.180      -1.18   -1.180   1.000  0.1962  161.0     1.00         2  0.500
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
Commiting td cost
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
Commiting td cost
   2    0.0 4.4e+00   0.647       0.04 1.2145e-09   1.235      -1.23   -1.235   1.000  0.1128  161.0     1.00         4  0.500
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterPinId 1
Driver also moved for 0?
   3    0.0 2.2e+00   1.000       0.02 1.2145e-09   1.235      -1.23   -1.235   0.000  0.0000  161.0     1.00         6  0.950
/usr/local/google/home/keithrothman/cat_x/vtr-verilog-to-routing/vpr/src/place/place.cpp:1987 comp_td_connection_cost: Assertion 'connection_delay[net][ipin] == delay' failed (Connection delays should already be updated, 0[1] 1.21452e-09 != 8.1553e-10).

@litghost
Copy link
Collaborator Author

It looks like on iteration 1, both the driver and the sink move, which makes sense. However in further iterations, only the sink moves, but the driven_by_moved_block still returns true, causing the timing update to be skipped.

@litghost
Copy link
Collaborator Author

The bug is in the "PinType" annotation in the clustered netlist. It looks like the driver pin is being marked as a sink.

@litghost
Copy link
Collaborator Author

I've added an assertion to confirm what I believe is the PinType bug:

      VTR_ASSERT_SAFE_MSG((pin_type == PinType::DRIVER && cluster_ctx.clb_nlist.net_driver(net) == pin) || (pin_type == PinType::SINK && cluster_ctx.clb_nlist.net_driver(net) != pin),
              "Net driver pin should be a DRIVER pin type, or not the driver!"); 

which does throw an assertion violation:

Moves per temperature: 2
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId out:do[0] ClusterPinId 0 PinType 1
Driver block 1
Block 0 moved
2. Recomputing net 0[1] block 0[0]
Computing delay 1[1] -> 0[0]
Commiting td cost
Driver block 1
Block 0 moved
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId out:do[0] ClusterPinId 0 PinType 1
Driver block 1
Block 0 moved
2. Recomputing net 0[1] block 0[0]
Computing delay 1[1] -> 0[0]
Commiting td cost
Driver block 1
Block 0 moved

---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
Tnum   Time       T Av Cost Av BB Cost Av TD Cost     CPD       sTNS     sWNS Ac Rate Std Dev  R lim Crit Exp Tot Moves  Alpha
      (sec)                                          (ns)       (ns)     (ns)                                                 
---- ------ ------- ------- ---------- ---------- ------- ---------- -------- ------- ------- ------ -------- --------- ------
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId out:do[0] ClusterPinId 0 PinType 1
Driver block 1
Block 0 moved
2. Recomputing net 0[1] block 0[0]
Computing delay 1[1] -> 0[0]
Commiting td cost
Driver block 1
Block 0 moved
blocks_affected.num_moved_blocks 1
Examining net 0
Recomputing ClusterNetId 0 ClusterBlockId di[1] ClusterPinId 1 PinType 1
vtr-verilog-to-routing/vpr/src/place/place.cpp:1625 update_td_delta_costs: Assertion '(pin_type == PinType::DRIVER && cluster_ctx.clb_nlist.net_driver(net) == pin) || (pin_type == PinType::SINK && cluster_ctx.clb_nlist.net_driver(net) != pin)' failed (Net driver pin should be a DRIVER pin type, or not the driver!).

@litghost
Copy link
Collaborator Author

I added the following sanity check to check_netlist, and it is failing:

diff --git a/vpr/src/base/check_netlist.cpp b/vpr/src/base/check_netlist.cpp
index e0344415e..f60af17a2 100644
--- a/vpr/src/base/check_netlist.cpp
+++ b/vpr/src/base/check_netlist.cpp
@@ -73,6 +73,45 @@ void check_netlist(int verbosity) {
         }
     }
 
+    // Verify clustered netlist relationships and assumptions.
+    for (ClusterNetId net_id : cluster_ctx.clb_nlist.nets()) {
+        ClusterPinId driver_pin = cluster_ctx.clb_nlist.net_driver(net_id);
+        for (ClusterPinId pin : cluster_ctx.clb_nlist.net_pins(net_id)) {
+            if (cluster_ctx.clb_nlist.pin_net(pin) != net_id) {
+                VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+                        "Net %zu pin %zu are inconsistently mapped\n", size_t(net_id), size_t(pin));
+            }
+
+            PinType pin_type = cluster_ctx.clb_nlist.pin_type(pin);
+            int pin_net_index = cluster_ctx.clb_nlist.pin_net_index(pin);
+            if(pin == driver_pin) {
+                if(pin_type != PinType::DRIVER) {
+                    VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+                            "Pin %zu for net %zu is the net driver, but pin_type(%d) != PinType::DRIVER",
+                            size_t(pin), size_t(net_id), pin_type);
+                }
+
+                if(pin_net_index != 0) {
+                    VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+                            "Pin %zu for net %zu is the net driver, but pin_net_index(%d) != 0",
+                            size_t(pin), size_t(net_id), pin_net_index);
+                }
+            } else {
+                if(pin_type != PinType::SINK) {
+                    VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+                            "Pin %zu for net %zu is a net sink, but pin_type(%d) != PinType::SINK",
+                            size_t(pin), size_t(net_id), pin_type);
+                }
+
+                if(pin_net_index == 0) {
+                    VPR_FATAL_ERROR(VPR_ERROR_OTHER,
+                            "Pin %zu for net %zu is the net sink, but pin_net_index(%d) == 0",
+                            size_t(pin), size_t(net_id), pin_net_index);
+                }
+            }
+        }
+    }
+
     error += check_for_duplicated_names();
 
     if (error != 0) {

If that check looks good, we should probably merged that, and I'll create a PR.

@litghost
Copy link
Collaborator Author

litghost commented Jul 17, 2020

I add a call to Netlist::verify and got a similar assertion violation:

diff --git a/vpr/src/base/check_netlist.cpp b/vpr/src/base/check_netlist.cpp
index e0344415e..ac2b7a963 100644
--- a/vpr/src/base/check_netlist.cpp
+++ b/vpr/src/base/check_netlist.cpp
@@ -39,6 +39,7 @@ void check_netlist(int verbosity) {
 
     /* This routine checks that the netlist makes sense. */
     auto& cluster_ctx = g_vpr_ctx.mutable_clustering();
+    cluster_ctx.clb_nlist.verify();
Error 1: 
Type: Unrecognized Error
File: vpr/src/base/netlist.tpp
Line: 1699
Message: Driver pin not found at expected index in net
The entire flow of VPR took 0.41 seconds (max_rss 67.2 MiB)

I think this confirms my earlier investigation. Now to determine where the problem is coming from.

@litghost
Copy link
Collaborator Author

Something weird is going on:

Detected 0 constant generators (to see names run with higher pack verbosity)
Adding pin 0 to net 0 that is pin_type 0
Adding pin 1 to net 0 that is pin_type 1
Finished loading packed FPGA netlist file (took 0.017435 seconds).
# Load Packing took 0.02 seconds (max_rss 67.6 MiB, delta_rss +36.3 MiB)
Warning 14: Netlist contains 0 global net to non-global architecture pin connections
Warning 15: Logic block #1 (di) has only 1 input pin 'di.inpad[0]' -- the whole block is hanging logic that should be swept.
Error 1: 
Type: Other
File: /usr/local/google/home/keithrothman/cat_x/vtr-verilog-to-routing/vpr/src/base/check_netlist.cpp
Line: 92
Message: Pin 0 for net 0 is the net driver, but pin_type(1) != PinType::DRIVER
The entire flow of VPR took 0.23 seconds (max_rss 67.8 MiB)

So pin 0 is added to net 0, and is pin type = 0 (DRIVER). Then a little while later, pin 0 is pin type 1? Still investigating.

@litghost
Copy link
Collaborator Author

I've located the bug, making a fix now.

@litghost
Copy link
Collaborator Author

PR opened for fix. I still need to make a replicating regression test, which I'll start on now.

@vaughnbetz
Copy link
Contributor

Nice bug hunting on this Keith! You probably already thought of this, but just in case: you can turn on sanitizers in the makefile if you suspect memory corruption.

Copy link

github-actions bot commented May 9, 2025

This issue has been inactive for a year and has been marked as stale. It will be closed in 15 days if it continues to be stale. If you believe this is still an issue, please add a comment.

@github-actions github-actions bot added the Stale label May 9, 2025
Copy link

This issue has been marked stale for 15 days and has been automatically closed.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 25, 2025
@vaughnbetz
Copy link
Contributor

Was fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants