-
Notifications
You must be signed in to change notification settings - Fork 415
Update BinaryHeap to FourAryHeap #2627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update BinaryHeap to FourAryHeap #2627
Conversation
…ich contains node cost to prevent unnecessary dereferecing, and turn on/off alignment for that struct
…to-routing into improve_binary_heap
…herited class of KAryHeap
…herited class of KAryHeap
…to-routing into improve_binary_heap
…to-routing into improve_binary_heap
…heap implementation
This reverts commit c58ca70.
@vaughnbetz I'm wondering about two things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I suggest some more comments in various places, as listed in the detailed feedback.
You'll have to check out the CI results. I look at vtr_nightly_test1_odin, and it is OK. A few small circuits have QoR changes that are beyond the tolerances, but for these small circuits they are not concerning. One has too good a minimum channel width, while the others have ones that are a bit worse but not concerning for small designs. You update the golden results and check them in (see the developer guide) to make those pass. Below are the tests from vtr_nightly_test1_odin as an example. Amin was also going to widen the error tolerance for some of these tests; I'll check on that PR as rebasing after it is merged may save you some work. You should also attach your QoR spreadsheets so we can see the results (which you have summarized) on big circuits. We'd want to see a speedup and no overall critical path increase or wirelength increase (or sub-1% so it is within noise). I'm fine with the current code and the small binary heap file -- more smaller files is fine. 2024-06-21T20:47:50.7138774Z �[32;1m20:47:50�[0m | Calculating QoR results... |
These issues should now be fixed |
9de4f45
to
d580ab1
Compare
8ef8272
to
d56143b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the code and QoR look good.
Forgot to answer this before, but it would be good to add a regtest for the binary heap to make sure it still works. It can be a single circuit and small.
We should add the nice Doxygen documentation to the VPR Internals as a new tab (Router heap). Can you also make the Doxygen config changes to make that happen?
|
||
size_t num_children = std::min((size_t)4, heap_tail_ - child_1); | ||
|
||
switch (num_children) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is faster than a simple loop or slower than a loop? We can merge this code, but I suggest running an experiment with this replaced by a loop and see if it is as fast or faster; it would be easier to read if there is no speed gain from this if nest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a loop is 3% slower; is this small enough to be worth changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's leave as is then. If you can add a comment giving this result that would be good though.
I added the heap section to the API documentation. However, I think this instruction may be outdated or incorrect; I could not run it successfully ( |
7c98cc9
to
d906ea8
Compare
@vaughnbetz All tests now passing |
A new FourAryHeap has been added, and this has been made the default VPR heap implementation.
Description
A new abstract class
KAryHeap
has been created. This class is like the previousBinaryHeap
, except it doesn't implementis_valid()
,get_heap_head()
, orsift_down()
. Also, a new structheap_elem
has been added, which contains at_heap
pointer and a floatcost
for a certain node. This is now what theheap_
vector contains, instead of justt_heap
pointers. InKAryHeap
's methods, instead of dereferencing at_heap
pointer to get its cost for node comparisons, it is stored inheap_elem
.BinaryHeap
andFourAryHeap
now inherit fromKAryHeap
, and implement its pure virtual methods (all other methods are not specific to either of these two heap types).FOUR_ARY_HEAP
has been added to thee_heap_type
enum, and has been made the default heap implementation.Motivation and Context
heap_
vector containst_heap
pointers for each node in the heap. To get a node's cost, the respective pointer would be dereferenced, which increase runtime since this would occur whenever there was a comparison between two nodes. Now, instead ofheap_
being made of these pointers, it is of a structheap_elem
which contains both the pointer and the node's cost. This cost is attained by dereferencing the pointer once inadd_to_heap()
, eliminating the need for dereferencing in future comparisons.heap_elem
is 12 bytes (pointer and float), which means 5 can fit on a 64-byte cache line; so, using a 4-ary heap instead of a binary heap is more cache-friendly, and improves runtime even with more comparisons (between 4 nodes vs 2).For the future: if, instead of using
t_heap
pointers, a 4-byte ID can be used,heap_elem
can be reduced to 8 bytes. Then, an 8-ary heap might be even better, even with many more comparisons. An 8-ary heap is fastest when usingt_heap
pointers inheap_
(8 can fit on a cache line) and this heap implementation as VPR's heap.How Has This Been Tested?
Three changes were considered:
heap_elem
instead oft_heap
pointers in theheap_
vector to eliminate unnecessary dereferencingThe 8 combinations of these changes were tested on the
koios_medium
benchmarks, and some combinations ontitan
benchmarks, on desktop. For smaller circuits, the results are inconsistent, but ontitan
, the both changes (1.) and (2.) result in runtime improvements. Change (3.) seems to have a negligible effect on runtime. Using changes (1.) and (2.) together results in a 4.2% reduction in runtime.On titan benchmarks (all values are geomeans, averaged across 8 runs):
max_rss
)Full Results
Types of changes
Checklist: