Skip to content

Added Customized Heap and Occupancy Profiling for MQ #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ueqri
Copy link
Collaborator

@ueqri ueqri commented Jul 2, 2024

  1. Added core affinity support for parallel router.
  2. Added a customized heap with indexing from one optimization and the ability to drain/clear the heap directly.
  3. Added a heap occupancy profiling method to gain insight into the MQ heap occupancy and workload.

Note: To assign the core affinity (i.e., pin threads to specific cores), please set the environment variable export VPR_CORE_AFFINITY=0,2-8. Formats such as 0,1,2,3,4,5,6,7, 0-7, 0-3,4-7, and 0,1-2,3-6,7 are all supported.

For multi-socket systems (i.e., with more than one NUMA node), please run numactl command for VPR execution. For example, numactl --cpunodebind=<NUMA node> --membind=<NUMA node> vpr_wrapper.sh (or simply numactl --cpunodebind=0 --membind=0 bash and run VPR or testing scripts inside the new bash).

Copy link
Owner

@AlexandreSinger AlexandreSinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Hang, these changes look good. I have left some comments that should be resolved. Specifically I would like you to IFDEF the tryPopWithPrio function so we can turn it ON/OFF. For convenience, can you please make if OFF by default.

@ueqri
Copy link
Collaborator Author

ueqri commented Jul 4, 2024

The deterministic version with queue draining (min-prio) optimization has been tested and works perfectly with Dijkstra settings (i.e., --astar_fac 0.0 --post_target_prune_fac 0.0 --post_target_prune_offset 0.0) on vtr_chain_largest benchmark. However, the deterministic version with A* settings doesn't work well, probably due to the invalid factor and offset numbers.

Copy link
Owner

@AlexandreSinger AlexandreSinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Hang, I reviewed again. Was this code tested with the options turned on and off? I found an undeclared variable that should not have compiled.

@AlexandreSinger
Copy link
Owner

The deterministic version with queue draining (min-prio) optimization has been tested and works perfectly with Dijkstra settings (i.e., --astar_fac 0.0 --post_target_prune_fac 0.0 --post_target_prune_offset 0.0) on vtr_chain_largest benchmark. However, the deterministic version with A* settings doesn't work well, probably due to the invalid factor and offset numbers.

@ueqri These results make sense. In order for the draining to be deterministic, the ordering heuristic needs to be an under-estimate (exactly like the pruning heuristic). When astar_fac is 0.0, the ordering heuristic is guaranteed to be an under-estimate. However, to make the A* version an under-estimate, you need --astar_offset to be some number such that it becomes an under-estimate. Since the option is off by default we can do this in a different PR, but I recommend cherry-picking this commit from VTR upstream verilog-to-routing#2601

We will need to modify some code in the parallel connection router to allow this. Once we have astar_offset we need to ensure that the ordering and pruning heuristics are both under-estimates in order to make this change deterministic. This in itself is a tradeoff and will require testing.

ueqri added 3 commits July 4, 2024 16:21
Added a customized heap with indexing from one optimization and the
ability to drain/clear the heap directly.

Added a heap occupancy profiling method to gain insight into the MQ heap
occupancy and workload.
Added detailed comments for configuring the core affinity list in the
parallel router.
Fixed issues in PR#6 and upgraded the CPS submodule.
@ueqri ueqri force-pushed the parallel-router-customized-heap-with-clear-method branch from c87723f to c713e0b Compare July 4, 2024 20:24
ueqri and others added 5 commits July 4, 2024 16:25
Fixed a bug (basically, a typo) in the Multi-Queue wrapper class.
Fixed a typo issue of a misspelled function name in the Multi-Queue
wrapper class.
The deterministic Dijkstra of parallel router is now working with all
the newly-introduced optimizations, including (1) core affinity, (2) VPR
binary heap (indexing from one), and (3) queue draining (with min-prio)
for MQ pop.
Using an astar_offset can help better tune the ordering heuristic for
the search used to find the shortest path in the routing graph. It is
also necessary to ensure that the heuristic is an underestimate (without
setting the astar_fac to 0.0).
Added `astar_offset` for parallel router based on the cherry-picked
commit (ada43a5) on VPR master. With this change and setting a suitable
value for `astar_offset` (ensuring the ordering heuristic being under-
estimated), the deterministic A-Star of parallel router works.
@ueqri
Copy link
Collaborator Author

ueqri commented Jul 6, 2024

The final version is ready. Deterministic A* works perfectly with --astar_fac 1.0 --astar_offset 3.05e-10 --post_target_prune_fac 1.0 --post_target_prune_offset 3.05e-10 on vtr_chain (verilog-to-routing#2601 was cherry-picked). Some other settings (e.g., deterministic Dijkstra/A*, non-deterministic Directed) with/without min-prio queue draining are all tested on MCNC and work as expected.

The commit (e7108) may be confusing: It was caused by a weird issue which I spent a long time investigating and still not sure why. The parent commit (fa8ba) cannot guarantee determinism. Only applying this commit (e7108) makes the deterministic version work, which doesn't quite make sense based on the changes in that commit. However, that's merely the only workaround I found to make determinism work.

@AlexandreSinger
Copy link
Owner

I ran into an interesting issue when using core affinity. When running multiple instances of the parallel router at the same time, they all get locked to the same exact cores. This greatly reduces their performance.

This is to be expected, but its odd since it is the default behaviour. I had to disable core affinity for my quick tests.

@ueqri
Copy link
Collaborator Author

ueqri commented Jul 8, 2024

When running multiple instances of the parallel router at the same time, they all get locked to the same exact cores. This greatly reduces their performance.

There is a workaround for this case: We can wrap the launch of parallel routers into different shell scripts. In each script, we specify the core affinity list, which will only affect that exact script running, such that different instances can be pinned to different sets of cores.

However, ultimately, the configuration should be put into the CLI options instead of the environment variables.

@AlexandreSinger
Copy link
Owner

When running multiple instances of the parallel router at the same time, they all get locked to the same exact cores. This greatly reduces their performance.

There is a workaround for this case: We can wrap the launch of parallel routers into different shell scripts. In each script, we specify the core affinity list, which will only affect that exact script running, such that different instances can be pinned to different sets of cores.

However, ultimately, the configuration should be put into the CLI options instead of the environment variables.

That makes sense! We can update our testing script to be aware of this! That's a very good point.

In the meantime I will keep core affinity off (just for my testing), but will turn it on once the testing script is updated.

Added four command-line options for MQ-based parallel router:
(1) `--multi_queue_num_threads <# threads>`
(2) `--multi_queue_num_queues <# queues>`
(3) `--multi_queue_direct_draining <on/off>`
(4) `--thread_affinity <off (meaning no affinity, leave OS schedule) or
    set a list of CPU core ID (the first one is for the main thread),
    e.g., 0,1,2,3 or 0-3 or 0-1,2-3 or 0,1-2,3>`
@ueqri ueqri force-pushed the parallel-router-customized-heap-with-clear-method branch from 199205d to 825b8be Compare July 12, 2024 02:28
@ueqri
Copy link
Collaborator Author

ueqri commented Jul 12, 2024

@AlexandreSinger Hey Alex, just added four command-line options for MQ-based parallel router:

  • --multi_queue_num_threads <# threads>
  • --multi_queue_num_queues <# queues>
  • --multi_queue_direct_draining <on/off>
  • --thread_affinity <off (meaning no affinity, leave OS schedule) or set a list of CPU core ID (the first one is for the main thread), e.g., 0,1,2,3 or 0-3 or 0-1,2-3 or 0,1-2,3>

All the options are tested and work as expected.

@AlexandreSinger
Copy link
Owner

AlexandreSinger commented Jul 12, 2024

@AlexandreSinger Hey Alex, just added four command-line options for MQ-based parallel router:

  • --multi_queue_num_threads <# threads>
  • --multi_queue_num_queues <# queues>
  • --multi_queue_direct_draining <on/off>
  • --thread_affinity <off (meaning no affinity, leave OS schedule) or set a list of CPU core ID (the first one is for the main thread), e.g., 0,1,2,3 or 0-3 or 0-1,2-3 or 0,1-2,3>

All the options are tested and work as expected.

Hi @ueqri , these changes look good to me. I have modified the testing script to have it work with these changes locally and tested them on VTR_chain. I have not verified determinism since I assume it should still be deterministic, but I have verified that at least the queue draining is working. I have started a full test suite of Koios_large which I will run all night. It will test Directed (with draining), A* (without draining), A* (with draining), and Dijkstra's (with draining) all on 12 threads; I expect this run to complete within 24 hours. I will then be able to compare the results with the results I have collected previously. If everything turns out green we can merge these changes and lock the code.

Thanks for helping with this change.

@AlexandreSinger
Copy link
Owner

@ueqri The performance test finished for Koios Large.

At 12 threads, 4 queues per thread:
Directed: Improved by 8.6%
A* (without direct draining): Slowed by 8.83%
A* (with direct draining): Slowed by 30%
Dijkstra's: Slowed by 7.5%

This is being compared to a run I did last week on SAVI4. This was run on SAVI3. Perhaps changing machines caused a slight difference, but I doubt it.

Since A* without direct draining slowed down by around 9%, it appears as though just the act of attempting to drain the queues may be causing some issues. I wonder if this line in the MQ may be the culprit:
image

I wonder if this should be memory_order_relaxed. But tbh I am not sure. Its very likely that this extra compare in Pop (which is a very very hot function) could be causing the small degredation.

In all honesty we should just merge this since it tells a good story and its not worth busting our heads over this any further. We could hypertune the implementation and squeeze performance by changing how ifs and else blocks are taken; but its not worth our time right now. I will merge this in; perhaps in the future we can better understand why this caused such a difference.

Copy link
Owner

@AlexandreSinger AlexandreSinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good job on this.

@AlexandreSinger AlexandreSinger merged commit 1bc0f64 into feature-fine-grained-parallel-router Jul 13, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants