[Router] Fixed the Segfault Bug in Parallel Connection Router #3094
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixed #3029.
Bug description
Segfault happens on the
strong_multiclock
regression test either in a routing attempt (never in the first attempt) in a min-W binary search or right after an attempt.Why the bug occurs
Short version
The sub threads (or helper threads) used by the parallel connection router are not properly terminated during the router object's destruction. Instead, they remain alive and continue accessing invalid data (pointers).
Detailed version
The parallel connection router manages a set of sub threads, which it creates in detached mode during class construction. This means the threads can continue to exist even after the router (including the router-owned thread objects) have been destructed.
When the parallel connection router is destructed (e.g., after completing one routing attempt in a min-W search), we use an atomic flag variable
ParallelConnectionRouter::is_router_destroying_
to signal the sub threads to exit, as shown in L11 below. Both the main thread and sub threads then synchronize at a thread barrier to ensure the sub threads are ready to be terminated. After that, the sub threads are expected to read the flag and terminate themselves.However, in some cases, the previous code shown below (simplified from codebase) could cause problems.
is_router_destroying_
.is_router_destroying_
at time T2 (which is after T1), they get a zero value (L18) and assume the router has not been destroyed (L20, i.e., a new connection routing will just start) after syncing at the barrier. The sub threads then try to dereference invalid pointers (L21), which will cause segfault.Other interesting and useful facts about this issue (which makes sense given the above explanation):
strong_multiclock
) in which the main thread finishes object destruction very quickly.Solution
Switched from detaching helper threads to joining threads in parallel connection router to ensure that helper threads terminate before main thread destroys the parallel connection router object.
Verification
The solution has been verified locally on wintermute and also on in the debug PR #3085.