Skip to content

[Router] Fixed the Segfault Bug in Parallel Connection Router #3094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

ueqri
Copy link
Contributor

@ueqri ueqri commented May 29, 2025

Fixed #3029.

Bug description

Segfault happens on the strong_multiclock regression test either in a routing attempt (never in the first attempt) in a min-W binary search or right after an attempt.

Why the bug occurs

Short version

The sub threads (or helper threads) used by the parallel connection router are not properly terminated during the router object's destruction. Instead, they remain alive and continue accessing invalid data (pointers).

Detailed version

The parallel connection router manages a set of sub threads, which it creates in detached mode during class construction. This means the threads can continue to exist even after the router (including the router-owned thread objects) have been destructed.

When the parallel connection router is destructed (e.g., after completing one routing attempt in a min-W search), we use an atomic flag variable ParallelConnectionRouter::is_router_destroying_ to signal the sub threads to exit, as shown in L11 below. Both the main thread and sub threads then synchronize at a thread barrier to ensure the sub threads are ready to be terminated. After that, the sub threads are expected to read the flag and terminate themselves.

However, in some cases, the previous code shown below (simplified from codebase) could cause problems.

  • Let's assume at time T0, both main (L12) and sub threads (L17) hit the barrier.
  • Then main thread (L13) finishes destruction very quickly at time T1, cleanup all member variables, including is_router_destroying_.
  • Meanwhile, when sub threads try to access is_router_destroying_ at time T2 (which is after T1), they get a zero value (L18) and assume the router has not been destroyed (L20, i.e., a new connection routing will just start) after syncing at the barrier. The sub threads then try to dereference invalid pointers (L21), which will cause segfault.

image

Other interesting and useful facts about this issue (which makes sense given the above explanation):

  • This issue frequently (and probably only) happens on very small circuits (e.g. strong_multiclock) in which the main thread finishes object destruction very quickly.
  • Also, this issue cannot be detected by ThreadSanitizer (TSan) since the "data race" between the main and helper threads is essentially caused by C++ implicit object destruction in the main thread rather than any explicit memory access.

Solution

Switched from detaching helper threads to joining threads in parallel connection router to ensure that helper threads terminate before main thread destroys the parallel connection router object.

Verification

The solution has been verified locally on wintermute and also on in the debug PR #3085.

Fixed verilog-to-routing#3029.

Switched from detaching helper threads to joining threads in parallel
connection router to ensure that helper threads terminate before main
thread destroys the parallel connection router object.
@github-actions github-actions bot added VPR VPR FPGA Placement & Routing Tool lang-cpp C/C++ code labels May 29, 2025
@ueqri ueqri self-assigned this May 29, 2025
Copy link
Contributor

@AlexandreSinger AlexandreSinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @ueqri !

We should keep this code in the back of our minds as we continue to work on the Multi-grain parallel router. If we end up constructing and destructing the parallel router often, it may hurt us to halt the main thread until all sub-threads are joined.

In the future, if we find this to be an issue, we could just have the sub-threads wait at another barrier after checking is_router_destroying. It may sound like this is useless, but I wonder if the time it takes to destroy the thread may start to cause some overhead.

Just something to think about.

@AlexandreSinger AlexandreSinger merged commit 7933a6c into verilog-to-routing:master May 29, 2025
32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang-cpp C/C++ code VPR VPR FPGA Placement & Routing Tool
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parallel Router] Random strong test failures
2 participants