[Router] Fixed the Segfault Bug in Parallel Connection Router #3094

ueqri · 2025-05-29T20:53:35Z

Bug description

Segfault happens on the strong_multiclock regression test either in a routing attempt (never in the first attempt) in a min-W binary search or right after an attempt.

Why the bug occurs

Short version

The sub threads (or helper threads) used by the parallel connection router are not properly terminated during the router object's destruction. Instead, they remain alive and continue accessing invalid data (pointers).

Detailed version

The parallel connection router manages a set of sub threads, which it creates in detached mode during class construction. This means the threads can continue to exist even after the router (including the router-owned thread objects) have been destructed.

When the parallel connection router is destructed (e.g., after completing one routing attempt in a min-W search), we use an atomic flag variable ParallelConnectionRouter::is_router_destroying_ to signal the sub threads to exit, as shown in L11 below. Both the main thread and sub threads then synchronize at a thread barrier to ensure the sub threads are ready to be terminated. After that, the sub threads are expected to read the flag and terminate themselves.

However, in some cases, the previous code shown below (simplified from codebase) could cause problems.

Let's assume at time T0, both main (L12) and sub threads (L17) hit the barrier.
Then main thread (L13) finishes destruction very quickly at time T1, cleanup all member variables, including is_router_destroying_.
Meanwhile, when sub threads try to access is_router_destroying_ at time T2 (which is after T1), they get a zero value (L18) and assume the router has not been destroyed (L20, i.e., a new connection routing will just start) after syncing at the barrier. The sub threads then try to dereference invalid pointers (L21), which will cause segfault.

Other interesting and useful facts about this issue (which makes sense given the above explanation):

This issue frequently (and probably only) happens on very small circuits (e.g. strong_multiclock) in which the main thread finishes object destruction very quickly.
Also, this issue cannot be detected by ThreadSanitizer (TSan) since the "data race" between the main and helper threads is essentially caused by C++ implicit object destruction in the main thread rather than any explicit memory access.

Solution

Switched from detaching helper threads to joining threads in parallel connection router to ensure that helper threads terminate before main thread destroys the parallel connection router object.

Verification

The solution has been verified locally on wintermute and also on in the debug PR #3085.

Fixed verilog-to-routing#3029. Switched from detaching helper threads to joining threads in parallel connection router to ensure that helper threads terminate before main thread destroys the parallel connection router object.

AlexandreSinger

LGTM, thanks @ueqri !

We should keep this code in the back of our minds as we continue to work on the Multi-grain parallel router. If we end up constructing and destructing the parallel router often, it may hurt us to halt the main thread until all sub-threads are joined.

In the future, if we find this to be an issue, we could just have the sub-threads wait at another barrier after checking is_router_destroying. It may sound like this is useless, but I wonder if the time it takes to destroy the thread may start to cause some overhead.

Just something to think about.

ueqri requested review from vaughnbetz and AlexandreSinger May 29, 2025 20:54

github-actions bot added VPR VPR FPGA Placement & Routing Tool lang-cpp C/C++ code labels May 29, 2025

ueqri self-assigned this May 29, 2025

AlexandreSinger approved these changes May 29, 2025

View reviewed changes

AlexandreSinger merged commit 7933a6c into verilog-to-routing:master May 29, 2025
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Router] Fixed the Segfault Bug in Parallel Connection Router #3094

[Router] Fixed the Segfault Bug in Parallel Connection Router #3094

Uh oh!

ueqri commented May 29, 2025

Uh oh!

AlexandreSinger left a comment

Uh oh!

Uh oh!

Uh oh!

[Router] Fixed the Segfault Bug in Parallel Connection Router #3094

[Router] Fixed the Segfault Bug in Parallel Connection Router #3094

Uh oh!

Conversation

ueqri commented May 29, 2025

Bug description

Why the bug occurs

Short version

Detailed version

Solution

Verification

Uh oh!

AlexandreSinger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!