Description
Consider the following litmus test:
C SB
{ int128_t *x = 0; int128_t *y = 0}
P0 (_Atomic __int128 *x, _Atomic __int128 *y) {
atomic_store_explicit(x,1,memory_order_seq_cst);
__int128 r0 = atomic_load_explicit(y,memory_order_seq_cst);
}
P1 (_Atomic __int128 *x, _Atomic __int128 *y) {
atomic_store_explicit(y,1,memory_order_seq_cst);
__int128 r0 = atomic_load_explicit(x,memory_order_seq_cst);
}
exists (P0:r0=0 /\ P1:r0 = 0)
where P0:r0 = 0
means thread P0
, local variable r0
has value 0
.
Building either P0/P1 for v8-a or v8.4 passes as expected. When simulating this test under the C/C++ model from its initial state, the outcome of execution in the exists clause is forbidden by the source model. The allowed outcomes are:
{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }
However when compiling P0
, to target armv8.4-a (https://godbolt.org/z/dxTrbGxoG) using clang trunk (dmb ish; stp; dmb ish; ldp; dmb ish
), compiling the store
on P1
to target armv8.0-a using clang (ldaxp;stlxp;cbnz
loop), and the load
on P1
to target armv8.4-a (ldp;dmb ish
) using clang. When compiled the assembly is as follows:
P0:
MOV X6, #1
DMB ISH
STP X5, X6 [reg_containing_x]
DMB ISH
LDP X2, X1, [reg_containing_y]
DMB ISH
P1:
loop:
LDAXP X5, X6, [reg_containing_y]
STLXP W4, X5, X6, [reg_containing_y]
CBNZ W4, loop
LDP X2, X1, [reg_containing_x]
DMB ISH
The compiled program has the following outcomes when simulated under the AArch64 model (rename P0:X1 to P0:r0 and P1:X1 to P1:r0 to match with source outcomes):
{ P0:r0=0; P1:r0=0; } <--- Forbidden by source model, bug!
{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }
which is due to the fact the effects of LDP
on P1
can be reordered before the effects of STLXP
on P1
since there is no leading DMB
barrier to prevent the reordering.
Since there is no barrier, we propose to fix the bug by adding said barrier before LDP
:
DMB ISH; LDP; DMB ISH
Which prevents the buggy outcome under the AArch64 memory model.
Besides using a DMB
, it is feasible to use LDAR
- making the SC LDP
stronger so it no longer reorders with STLXP.
Note it is also possible to make STLXP
stronger by adding a DMB
after the loop (but that's not what is done for other atomic sizes)
I have validated this bug whilst discussing with Wilco from Arm's compiler teams.
This bug would not have been caught in normal execution, but only when multiple implementations are mixed together.