-
Notifications
You must be signed in to change notification settings - Fork 0
Added Customized Heap and Occupancy Profiling for MQ #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Customized Heap and Occupancy Profiling for MQ #6
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Hang, these changes look good. I have left some comments that should be resolved. Specifically I would like you to IFDEF the tryPopWithPrio function so we can turn it ON/OFF. For convenience, can you please make if OFF by default.
The deterministic version with queue draining (min-prio) optimization has been tested and works perfectly with Dijkstra settings (i.e., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Hang, I reviewed again. Was this code tested with the options turned on and off? I found an undeclared variable that should not have compiled.
@ueqri These results make sense. In order for the draining to be deterministic, the ordering heuristic needs to be an under-estimate (exactly like the pruning heuristic). When astar_fac is 0.0, the ordering heuristic is guaranteed to be an under-estimate. However, to make the A* version an under-estimate, you need --astar_offset to be some number such that it becomes an under-estimate. Since the option is off by default we can do this in a different PR, but I recommend cherry-picking this commit from VTR upstream verilog-to-routing#2601 We will need to modify some code in the parallel connection router to allow this. Once we have astar_offset we need to ensure that the ordering and pruning heuristics are both under-estimates in order to make this change deterministic. This in itself is a tradeoff and will require testing. |
Added a customized heap with indexing from one optimization and the ability to drain/clear the heap directly. Added a heap occupancy profiling method to gain insight into the MQ heap occupancy and workload.
Added detailed comments for configuring the core affinity list in the parallel router.
Fixed issues in PR#6 and upgraded the CPS submodule.
c87723f
to
c713e0b
Compare
Fixed a bug (basically, a typo) in the Multi-Queue wrapper class.
Fixed a typo issue of a misspelled function name in the Multi-Queue wrapper class.
The deterministic Dijkstra of parallel router is now working with all the newly-introduced optimizations, including (1) core affinity, (2) VPR binary heap (indexing from one), and (3) queue draining (with min-prio) for MQ pop.
Using an astar_offset can help better tune the ordering heuristic for the search used to find the shortest path in the routing graph. It is also necessary to ensure that the heuristic is an underestimate (without setting the astar_fac to 0.0).
Added `astar_offset` for parallel router based on the cherry-picked commit (ada43a5) on VPR master. With this change and setting a suitable value for `astar_offset` (ensuring the ordering heuristic being under- estimated), the deterministic A-Star of parallel router works.
The final version is ready. Deterministic A* works perfectly with The commit (e7108) may be confusing: It was caused by a weird issue which I spent a long time investigating and still not sure why. The parent commit (fa8ba) cannot guarantee determinism. Only applying this commit (e7108) makes the deterministic version work, which doesn't quite make sense based on the changes in that commit. However, that's merely the only workaround I found to make determinism work. |
I ran into an interesting issue when using core affinity. When running multiple instances of the parallel router at the same time, they all get locked to the same exact cores. This greatly reduces their performance. This is to be expected, but its odd since it is the default behaviour. I had to disable core affinity for my quick tests. |
There is a workaround for this case: We can wrap the launch of parallel routers into different shell scripts. In each script, we specify the core affinity list, which will only affect that exact script running, such that different instances can be pinned to different sets of cores. However, ultimately, the configuration should be put into the CLI options instead of the environment variables. |
That makes sense! We can update our testing script to be aware of this! That's a very good point. In the meantime I will keep core affinity off (just for my testing), but will turn it on once the testing script is updated. |
Added four command-line options for MQ-based parallel router: (1) `--multi_queue_num_threads <# threads>` (2) `--multi_queue_num_queues <# queues>` (3) `--multi_queue_direct_draining <on/off>` (4) `--thread_affinity <off (meaning no affinity, leave OS schedule) or set a list of CPU core ID (the first one is for the main thread), e.g., 0,1,2,3 or 0-3 or 0-1,2-3 or 0,1-2,3>`
199205d
to
825b8be
Compare
@AlexandreSinger Hey Alex, just added four command-line options for MQ-based parallel router:
All the options are tested and work as expected. |
Hi @ueqri , these changes look good to me. I have modified the testing script to have it work with these changes locally and tested them on VTR_chain. I have not verified determinism since I assume it should still be deterministic, but I have verified that at least the queue draining is working. I have started a full test suite of Koios_large which I will run all night. It will test Directed (with draining), A* (without draining), A* (with draining), and Dijkstra's (with draining) all on 12 threads; I expect this run to complete within 24 hours. I will then be able to compare the results with the results I have collected previously. If everything turns out green we can merge these changes and lock the code. Thanks for helping with this change. |
@ueqri The performance test finished for Koios Large. At 12 threads, 4 queues per thread: This is being compared to a run I did last week on SAVI4. This was run on SAVI3. Perhaps changing machines caused a slight difference, but I doubt it. Since A* without direct draining slowed down by around 9%, it appears as though just the act of attempting to drain the queues may be causing some issues. I wonder if this line in the MQ may be the culprit: I wonder if this should be memory_order_relaxed. But tbh I am not sure. Its very likely that this extra compare in Pop (which is a very very hot function) could be causing the small degredation. In all honesty we should just merge this since it tells a good story and its not worth busting our heads over this any further. We could hypertune the implementation and squeeze performance by changing how ifs and else blocks are taken; but its not worth our time right now. I will merge this in; perhaps in the future we can better understand why this caused such a difference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, good job on this.
1bc0f64
into
feature-fine-grained-parallel-router
Note: To assign the core affinity (i.e., pin threads to specific cores), please set the environment variable
export VPR_CORE_AFFINITY=0,2-8
. Formats such as0,1,2,3,4,5,6,7
,0-7
,0-3,4-7
, and0,1-2,3-6,7
are all supported.For multi-socket systems (i.e., with more than one NUMA node), please run
numactl
command for VPR execution. For example,numactl --cpunodebind=<NUMA node> --membind=<NUMA node> vpr_wrapper.sh
(or simplynumactl --cpunodebind=0 --membind=0 bash
and run VPR or testing scripts inside the new bash).