Skip to content

[AArch64]: 128-bit Sequentially Consistent load allows reordering before prior store when armv8 and armv8.4 implementations are Mixed #81978

Open
@lukeg101

Description

@lukeg101

Consider the following litmus test:

C SB

{ int128_t *x = 0; int128_t *y = 0}

P0 (_Atomic __int128 *x, _Atomic __int128 *y) {
  atomic_store_explicit(x,1,memory_order_seq_cst);
  __int128 r0 = atomic_load_explicit(y,memory_order_seq_cst);
}

P1 (_Atomic __int128 *x, _Atomic __int128 *y) {
  atomic_store_explicit(y,1,memory_order_seq_cst);
  __int128 r0 = atomic_load_explicit(x,memory_order_seq_cst);
}

exists (P0:r0=0 /\ P1:r0 = 0)

where P0:r0 = 0 means thread P0, local variable r0 has value 0.

Building either P0/P1 for v8-a or v8.4 passes as expected. When simulating this test under the C/C++ model from its initial state, the outcome of execution in the exists clause is forbidden by the source model. The allowed outcomes are:

{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }

However when compiling P0, to target armv8.4-a (https://godbolt.org/z/dxTrbGxoG) using clang trunk (dmb ish; stp; dmb ish; ldp; dmb ish), compiling the store on P1 to target armv8.0-a using clang (ldaxp;stlxp;cbnz loop), and the load on P1 to target armv8.4-a (ldp;dmb ish) using clang. When compiled the assembly is as follows:

P0:
  MOV X6, #1
  DMB ISH
  STP X5, X6 [reg_containing_x]
  DMB ISH
  LDP X2, X1, [reg_containing_y] 
  DMB ISH
 
P1:
loop:
  LDAXP X5, X6, [reg_containing_y]
  STLXP W4, X5, X6, [reg_containing_y]
  CBNZ W4, loop
  LDP X2, X1, [reg_containing_x]
  DMB ISH

The compiled program has the following outcomes when simulated under the AArch64 model (rename P0:X1 to P0:r0 and P1:X1 to P1:r0 to match with source outcomes):

{ P0:r0=0; P1:r0=0; } <--- Forbidden by source model, bug!
{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }

which is due to the fact the effects of LDP on P1 can be reordered before the effects of STLXP on P1 since there is no leading DMB barrier to prevent the reordering.

Since there is no barrier, we propose to fix the bug by adding said barrier before LDP:

DMB ISH; LDP; DMB ISH

Which prevents the buggy outcome under the AArch64 memory model.

Besides using a DMB, it is feasible to use LDAR - making the SC LDP stronger so it no longer reorders with STLXP. Note it is also possible to make STLXP stronger by adding a DMB after the loop (but that's not what is done for other atomic sizes)

I have validated this bug whilst discussing with Wilco from Arm's compiler teams.

This bug would not have been caught in normal execution, but only when multiple implementations are mixed together.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions