Added Customized Heap and Occupancy Profiling for MQ #6

ueqri · 2024-07-02T13:08:36Z

Added core affinity support for parallel router.
Added a customized heap with indexing from one optimization and the ability to drain/clear the heap directly.
Added a heap occupancy profiling method to gain insight into the MQ heap occupancy and workload.

Note: To assign the core affinity (i.e., pin threads to specific cores), please set the environment variable export VPR_CORE_AFFINITY=0,2-8. Formats such as 0,1,2,3,4,5,6,7, 0-7, 0-3,4-7, and 0,1-2,3-6,7 are all supported.

For multi-socket systems (i.e., with more than one NUMA node), please run numactl command for VPR execution. For example, numactl --cpunodebind=<NUMA node> --membind=<NUMA node> vpr_wrapper.sh (or simply numactl --cpunodebind=0 --membind=0 bash and run VPR or testing scripts inside the new bash).

AlexandreSinger

Hi Hang, these changes look good. I have left some comments that should be resolved. Specifically I would like you to IFDEF the tryPopWithPrio function so we can turn it ON/OFF. For convenience, can you please make if OFF by default.

vpr/src/route/parallel_connection_router.cpp

vpr/src/route/multi_queue_priority_queue.cpp

vpr/src/route/parallel_connection_router.h

ueqri · 2024-07-04T15:35:51Z

The deterministic version with queue draining (min-prio) optimization has been tested and works perfectly with Dijkstra settings (i.e., --astar_fac 0.0 --post_target_prune_fac 0.0 --post_target_prune_offset 0.0) on vtr_chain_largest benchmark. However, the deterministic version with A* settings doesn't work well, probably due to the invalid factor and offset numbers.

AlexandreSinger

Hi Hang, I reviewed again. Was this code tested with the options turned on and off? I found an undeclared variable that should not have compiled.

vpr/src/route/multi_queue_priority_queue.cpp

AlexandreSinger · 2024-07-04T15:55:30Z

The deterministic version with queue draining (min-prio) optimization has been tested and works perfectly with Dijkstra settings (i.e., --astar_fac 0.0 --post_target_prune_fac 0.0 --post_target_prune_offset 0.0) on vtr_chain_largest benchmark. However, the deterministic version with A* settings doesn't work well, probably due to the invalid factor and offset numbers.

@ueqri These results make sense. In order for the draining to be deterministic, the ordering heuristic needs to be an under-estimate (exactly like the pruning heuristic). When astar_fac is 0.0, the ordering heuristic is guaranteed to be an under-estimate. However, to make the A* version an under-estimate, you need --astar_offset to be some number such that it becomes an under-estimate. Since the option is off by default we can do this in a different PR, but I recommend cherry-picking this commit from VTR upstream verilog-to-routing#2601

We will need to modify some code in the parallel connection router to allow this. Once we have astar_offset we need to ensure that the ordering and pruning heuristics are both under-estimates in order to make this change deterministic. This in itself is a tradeoff and will require testing.

Added a customized heap with indexing from one optimization and the ability to drain/clear the heap directly. Added a heap occupancy profiling method to gain insight into the MQ heap occupancy and workload.

Added detailed comments for configuring the core affinity list in the parallel router.

Fixed issues in PR#6 and upgraded the CPS submodule.

Fixed a bug (basically, a typo) in the Multi-Queue wrapper class.

Fixed a typo issue of a misspelled function name in the Multi-Queue wrapper class.

The deterministic Dijkstra of parallel router is now working with all the newly-introduced optimizations, including (1) core affinity, (2) VPR binary heap (indexing from one), and (3) queue draining (with min-prio) for MQ pop.

Using an astar_offset can help better tune the ordering heuristic for the search used to find the shortest path in the routing graph. It is also necessary to ensure that the heuristic is an underestimate (without setting the astar_fac to 0.0).

Added `astar_offset` for parallel router based on the cherry-picked commit (ada43a5) on VPR master. With this change and setting a suitable value for `astar_offset` (ensuring the ordering heuristic being under- estimated), the deterministic A-Star of parallel router works.

ueqri · 2024-07-06T04:33:53Z

The final version is ready. Deterministic A* works perfectly with --astar_fac 1.0 --astar_offset 3.05e-10 --post_target_prune_fac 1.0 --post_target_prune_offset 3.05e-10 on vtr_chain (verilog-to-routing#2601 was cherry-picked). Some other settings (e.g., deterministic Dijkstra/A*, non-deterministic Directed) with/without min-prio queue draining are all tested on MCNC and work as expected.

The commit (e7108) may be confusing: It was caused by a weird issue which I spent a long time investigating and still not sure why. The parent commit (fa8ba) cannot guarantee determinism. Only applying this commit (e7108) makes the deterministic version work, which doesn't quite make sense based on the changes in that commit. However, that's merely the only workaround I found to make determinism work.

AlexandreSinger · 2024-07-08T14:01:52Z

I ran into an interesting issue when using core affinity. When running multiple instances of the parallel router at the same time, they all get locked to the same exact cores. This greatly reduces their performance.

This is to be expected, but its odd since it is the default behaviour. I had to disable core affinity for my quick tests.

ueqri · 2024-07-08T14:16:34Z

When running multiple instances of the parallel router at the same time, they all get locked to the same exact cores. This greatly reduces their performance.

There is a workaround for this case: We can wrap the launch of parallel routers into different shell scripts. In each script, we specify the core affinity list, which will only affect that exact script running, such that different instances can be pinned to different sets of cores.

However, ultimately, the configuration should be put into the CLI options instead of the environment variables.

AlexandreSinger · 2024-07-08T14:18:58Z

When running multiple instances of the parallel router at the same time, they all get locked to the same exact cores. This greatly reduces their performance.

There is a workaround for this case: We can wrap the launch of parallel routers into different shell scripts. In each script, we specify the core affinity list, which will only affect that exact script running, such that different instances can be pinned to different sets of cores.

However, ultimately, the configuration should be put into the CLI options instead of the environment variables.

That makes sense! We can update our testing script to be aware of this! That's a very good point.

In the meantime I will keep core affinity off (just for my testing), but will turn it on once the testing script is updated.

Added four command-line options for MQ-based parallel router: (1) `--multi_queue_num_threads <# threads>` (2) `--multi_queue_num_queues <# queues>` (3) `--multi_queue_direct_draining <on/off>` (4) `--thread_affinity <off (meaning no affinity, leave OS schedule) or set a list of CPU core ID (the first one is for the main thread), e.g., 0,1,2,3 or 0-3 or 0-1,2-3 or 0,1-2,3>`

ueqri · 2024-07-12T02:30:26Z

@AlexandreSinger Hey Alex, just added four command-line options for MQ-based parallel router:

--multi_queue_num_threads <# threads>
--multi_queue_num_queues <# queues>
--multi_queue_direct_draining <on/off>
--thread_affinity <off (meaning no affinity, leave OS schedule) or set a list of CPU core ID (the first one is for the main thread), e.g., 0,1,2,3 or 0-3 or 0-1,2-3 or 0,1-2,3>

All the options are tested and work as expected.

AlexandreSinger · 2024-07-12T03:43:24Z

@AlexandreSinger Hey Alex, just added four command-line options for MQ-based parallel router:

--multi_queue_num_threads <# threads>

--multi_queue_num_queues <# queues>

--multi_queue_direct_draining <on/off>

--thread_affinity <off (meaning no affinity, leave OS schedule) or set a list of CPU core ID (the first one is for the main thread), e.g., 0,1,2,3 or 0-3 or 0-1,2-3 or 0,1-2,3>

All the options are tested and work as expected.

Hi @ueqri , these changes look good to me. I have modified the testing script to have it work with these changes locally and tested them on VTR_chain. I have not verified determinism since I assume it should still be deterministic, but I have verified that at least the queue draining is working. I have started a full test suite of Koios_large which I will run all night. It will test Directed (with draining), A* (without draining), A* (with draining), and Dijkstra's (with draining) all on 12 threads; I expect this run to complete within 24 hours. I will then be able to compare the results with the results I have collected previously. If everything turns out green we can merge these changes and lock the code.

Thanks for helping with this change.

AlexandreSinger · 2024-07-13T18:34:08Z

@ueqri The performance test finished for Koios Large.

At 12 threads, 4 queues per thread:
Directed: Improved by 8.6%
A* (without direct draining): Slowed by 8.83%
A* (with direct draining): Slowed by 30%
Dijkstra's: Slowed by 7.5%

This is being compared to a run I did last week on SAVI4. This was run on SAVI3. Perhaps changing machines caused a slight difference, but I doubt it.

Since A* without direct draining slowed down by around 9%, it appears as though just the act of attempting to drain the queues may be causing some issues. I wonder if this line in the MQ may be the culprit:

I wonder if this should be memory_order_relaxed. But tbh I am not sure. Its very likely that this extra compare in Pop (which is a very very hot function) could be causing the small degredation.

In all honesty we should just merge this since it tells a good story and its not worth busting our heads over this any further. We could hypertune the implementation and squeeze performance by changing how ifs and else blocks are taken; but its not worth our time right now. I will merge this in; perhaps in the future we can better understand why this caused such a difference.

AlexandreSinger

LGTM, good job on this.

github-actions bot added VPR lang-cpp labels Jul 2, 2024

AlexandreSinger requested changes Jul 2, 2024

View reviewed changes

vpr/src/route/parallel_connection_router.cpp Outdated Show resolved Hide resolved

vpr/src/route/multi_queue_priority_queue.cpp Outdated Show resolved Hide resolved

vpr/src/route/parallel_connection_router.h Show resolved Hide resolved

ueqri requested a review from AlexandreSinger July 4, 2024 15:35

github-actions bot added the external_libs label Jul 4, 2024

AlexandreSinger requested changes Jul 4, 2024

View reviewed changes

vpr/src/route/multi_queue_priority_queue.cpp Outdated Show resolved Hide resolved

ueqri added 3 commits July 4, 2024 16:21

[ParallelRouter] Added Customized Heap and Occupancy Profiling for MQ

91cb6c1

Added a customized heap with indexing from one optimization and the ability to drain/clear the heap directly. Added a heap occupancy profiling method to gain insight into the MQ heap occupancy and workload.

[ParallelRouter] Added Comments for Setting Core Affinity List

b6ffc69

Added detailed comments for configuring the core affinity list in the parallel router.

[ParallelRouter] Fixed Issues and Upgraded CPS Submodule

c713e0b

Fixed issues in PR#6 and upgraded the CPS submodule.

ueqri force-pushed the parallel-router-customized-heap-with-clear-method branch from c87723f to c713e0b Compare July 4, 2024 20:24

ueqri and others added 5 commits July 4, 2024 16:25

[ParallelRouter] Fixed the Bug in the Multi-Queue Wrapper Class

a3856d1

Fixed a bug (basically, a typo) in the Multi-Queue wrapper class.

[ParallelRouter] Fixed a Typo in the Multi-Queue Wrapper Class

fa8ba4c

Fixed a typo issue of a misspelled function name in the Multi-Queue wrapper class.

[Route] Added astar_offset Parameter

6dc8f14

Using an astar_offset can help better tune the ordering heuristic for the search used to find the shortest path in the routing graph. It is also necessary to ensure that the heuristic is an underestimate (without setting the astar_fac to 0.0).

ueqri force-pushed the parallel-router-customized-heap-with-clear-method branch from 199205d to 825b8be Compare July 12, 2024 02:28

AlexandreSinger approved these changes Jul 13, 2024

View reviewed changes

AlexandreSinger merged commit 1bc0f64 into feature-fine-grained-parallel-router Jul 13, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added Customized Heap and Occupancy Profiling for MQ #6

Added Customized Heap and Occupancy Profiling for MQ #6

Uh oh!

ueqri commented Jul 2, 2024

Uh oh!

AlexandreSinger left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ueqri commented Jul 4, 2024

Uh oh!

AlexandreSinger left a comment

Uh oh!

Uh oh!

AlexandreSinger commented Jul 4, 2024

Uh oh!

ueqri commented Jul 6, 2024

Uh oh!

AlexandreSinger commented Jul 8, 2024

Uh oh!

ueqri commented Jul 8, 2024

Uh oh!

AlexandreSinger commented Jul 8, 2024

Uh oh!

ueqri commented Jul 12, 2024

Uh oh!

AlexandreSinger commented Jul 12, 2024 •

edited

Loading

Uh oh!

AlexandreSinger commented Jul 13, 2024

Uh oh!

AlexandreSinger left a comment

Uh oh!

Uh oh!

Uh oh!

Added Customized Heap and Occupancy Profiling for MQ #6

Added Customized Heap and Occupancy Profiling for MQ #6

Uh oh!

Conversation

ueqri commented Jul 2, 2024

Uh oh!

AlexandreSinger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ueqri commented Jul 4, 2024

Uh oh!

AlexandreSinger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AlexandreSinger commented Jul 4, 2024

Uh oh!

ueqri commented Jul 6, 2024

Uh oh!

AlexandreSinger commented Jul 8, 2024

Uh oh!

ueqri commented Jul 8, 2024

Uh oh!

AlexandreSinger commented Jul 8, 2024

Uh oh!

ueqri commented Jul 12, 2024

Uh oh!

AlexandreSinger commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexandreSinger commented Jul 13, 2024

Uh oh!

AlexandreSinger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AlexandreSinger commented Jul 12, 2024 •

edited

Loading