-
Notifications
You must be signed in to change notification settings - Fork 415
Placer reporting cost errors during consistency checks #1402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm working on identifying when this abort started happening. @acomodi believes it was around when the incremental timing analysis was added. |
I tested a circuit that was a little more complicated (a counter), rather than simple straight inport -> outport, same issue. I'll see how much I can reduce the circuit and still produce the error. |
I also ran into this issue when working on #1205. Disabling incremental timing avoided the error. |
Which graph were you running on? So far I've noticed the placer error showing up on 7-series ROI based graphs, but I would've assumed you would've been testing on either full 7-series graphs or a titan arch. |
@acomodi mentioned this, but if the ASSERT level is increased from 2 to 3 the error changes to:
|
An additional note, the function that provides the driver pin, which causes this assertion error is the following: vtr-verilog-to-routing/vpr/src/place/place.cpp Lines 1862 to 1923 in 6af4a38
|
@litghost I was using the full 50t graph. I assumed that the problem was triggered by my new placement code. |
No I don't believe your new placement code along trigger this issue. However in my limited testing, the full 50T graph did not produce this error, so it possible that ROI or full 50T + new placement triggers the issues. There is some kind of latent issue being uncovered here. |
I did a quick test with HackerFoo@7e35d15 and the counter test on the xc7 50T ROI arch was able to complete if assert level was 2. On assert level 3, I got:
WIll continue investigating this week. |
@kmurray Why is the assertion at vtr-verilog-to-routing/vpr/src/place/place.cpp Line 1963 in 6af4a38
I'm not clear why |
The assertion from here should probably be behind the incremental |
The assumption that pin #0 on a net is the driver is deeply embedded in the netlist data structures. So the assert looks valid. |
@Bill-hbrhbr this may be of interest; hard to debug without being able to run Symbiflow though. |
@vaughnbetz / @Bill-hbrhbr I've added instructions for replicating our failure. I can provide more blif's if that is useful, however the current one is just a single input / output. |
@Bill-hbrhbr : Keith has provided instructions on how to reproduce at the top of this issue -- can you take a look? |
The inherent rule is that the criticalities, delays, and timing costs are only calculated for sink pins, not the driver pins. All the data structures (or at least for Kevin's incremental sta code) are based on this fact, including the PlacerTimingCosts's heap in timing_place.h
Indeed, the bug appeared since place_crit.pins_with_modified_criticality() provided a driver pin in the failed test case and had its timing cost computed, which messed up the heap data structure for calculating the total timing cost (sum of all sink pins' costs). This effect can be generated by a lot of different test cases, and it also depends on how the make_range function orders the elements in the vtr_id_set structure (which in turn affects how the heap is updated). This should not happen. I suspect that the cause is within the following function: vtr-verilog-to-routing/vpr/src/place/timing_place.cpp Lines 31 to 87 in 6af4a38
Specifically, this line
created the problem. Only sink pins should be added according to the logic of timing cost computation. I suppose the logic is the same for non-incremental timing cost computation, so both line #42 and line #72 need to be modified. Regarding how disabling the incremental timing will bypass this bug, I'm not so sure about the cause, so correct my reasonings above if they are wrong. I implemented a fix in my local repo, which only adds sink pins to be computed for criticalities. It made the bug go away and still passed vtr_reg_strong tests. I'll open a PR upon the request of anyone in this thread. |
Nice work finding this so quickly Bill! |
If you have a fix, please create a PR as soon as possible. This bug is currently blocking downstream VTR + symbiflow integration, and it would be good to confirm if your fix works or not. |
Whoops, fat finger. Bill has a fix. |
Yep working on it right now |
@Bill-hbrhbr : We are still seeing:
which I believe you will replicate if you increase the ASSERT level from 2 to 3. |
Maybe there needs to be an additional check to the assert that "crit_exponent == last_crit_exponent_ && ..."? |
looking into it right now |
The way I see it, the logic that you modified does a brute force retiming if |
@litghost I haven't been able to replicate this bug. Is this produced on a different set of benchmarks? |
@Bill-hbrhbr I've been able to replicate the failing assertion issue, here are the latest replication instructions: 1.Get input test data:
2d2c0b6 has your earlier fix, so if you change |
@vaughnbetz / @Bill-hbrhbr I believe this assertion is not in error, but I'm not totally sure. I'm debugging on my side, let me know if you find anything. |
Responding to a prior comment: I don't think this assert has anything to do with crit_exponent. The assert checks that the delay of a connection stored in a certain matrix entry matches what we get if we ask for the delay of that connection right now. We're failing that test. It looks like the routine is called in two places: once by the regular (incremental) timing cost update routine, and once by the brute-force routine that walks through all the connections. Bill, if you set a breakpoint on the assert and trace back to which caller it was, which connection it was (inet, ipin) and how different the stored delay is versus the recomputed one that may shed some light on this. Is there anything strange about this connection? |
Nope! In fact this design is super simple, one |
Does the assert only occur on that design, or also on larger ones? Are both I/Os locked in place (nothing to move)? Maybe there's a corner case if we don't move anything. Or, if it happens on larger designs too, it's not that ... |
They are not locked. I've already added a print checking a move took place. |
Looks like the bug is in |
This design has 1 net, with two pins 1 (driver) and 0 (sink). You can see in the trace that the placer only moves one block, but the incremental logic sees that the sink has been affected, but still skips the update because it believes the driver has also moved. Not sure why just yet:
|
It looks like on iteration 1, both the driver and the sink move, which makes sense. However in further iterations, only the sink moves, but the |
The bug is in the "PinType" annotation in the clustered netlist. It looks like the driver pin is being marked as a sink. |
I've added an assertion to confirm what I believe is the PinType bug:
which does throw an assertion violation:
|
I added the following sanity check to
If that check looks good, we should probably merged that, and I'll create a PR. |
I add a call to
I think this confirms my earlier investigation. Now to determine where the problem is coming from. |
Something weird is going on:
So pin 0 is added to net 0, and is pin type = 0 (DRIVER). Then a little while later, pin 0 is pin type 1? Still investigating. |
I've located the bug, making a fix now. |
PR opened for fix. I still need to make a replicating regression test, which I'll start on now. |
Nice bug hunting on this Keith! You probably already thought of this, but just in case: you can turn on sanitizers in the makefile if you suspect memory corruption. |
This issue has been inactive for a year and has been marked as stale. It will be closed in 15 days if it continues to be stale. If you believe this is still an issue, please add a comment. |
This issue has been marked stale for 15 days and has been automatically closed. |
Was fixed. |
Uh oh!
There was an error while loading. Please reload this page.
Expected Behaviour
Placer places circuit and writes placement file.
Current Behaviour
Placer aborts with the following error:
Possible Solution
Working on reduced test case. Circuit in this case was very simple, multiple inpad -> outpad connections.
Steps to Reproduce
1.Get input test data:
wget https://storage.googleapis.com/debug_data/vpr_failing_test.tar.bz2
tar -xf vpr_failing_test.tar.bz2
git clone https://github.com/litghost/vtr-verilog-to-routing.git
cd vtr-verilog-to-routing
git checkout -b test
git reset --hard 87c4d426a5096dc6b8ab7e9adb0c6a39b2154bd2
mkdir build
cd build
cmake -DVTR_ASSERT_LEVEL=3 ..
make -j
cd ../../vpr_failing_test
cp ../vtr-verilog-to-routing/vpr/vpr .
./run.sh
Context
Your Environment
The text was updated successfully, but these errors were encountered: