[AArch64]: 128-bit Sequentially Consistent load allows reordering before prior store when armv8 and armv8.4 implementations are Mixed

Consider the following litmus test:
```
C SB

{ int128_t *x = 0; int128_t *y = 0}

P0 (_Atomic __int128 *x, _Atomic __int128 *y) {
  atomic_store_explicit(x,1,memory_order_seq_cst);
  __int128 r0 = atomic_load_explicit(y,memory_order_seq_cst);
}

P1 (_Atomic __int128 *x, _Atomic __int128 *y) {
  atomic_store_explicit(y,1,memory_order_seq_cst);
  __int128 r0 = atomic_load_explicit(x,memory_order_seq_cst);
}

exists (P0:r0=0 /\ P1:r0 = 0)

```
where `P0:r0 = 0` means thread `P0`, local variable `r0` has value `0`.

Building either P0/P1 for v8-a or v8.4 passes as expected. When simulating this test under the C/C++ model from its initial state, the outcome of execution in the exists clause is forbidden by the source model. The allowed outcomes are:
```
{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }
```

However when compiling `P0`, to target armv8.4-a (https://godbolt.org/z/dxTrbGxoG) using clang trunk (`dmb ish; stp; dmb ish; ldp; dmb ish`), compiling the `store` on `P1` to target armv8.0-a using clang (`ldaxp;stlxp;cbnz` loop), and the `load` on `P1` to target armv8.4-a (`ldp;dmb ish`) using clang. When compiled the assembly is as follows:
```
P0:
  MOV X6, #1
  DMB ISH
  STP X5, X6 [reg_containing_x]
  DMB ISH
  LDP X2, X1, [reg_containing_y] 
  DMB ISH
 
P1:
loop:
  LDAXP X5, X6, [reg_containing_y]
  STLXP W4, X5, X6, [reg_containing_y]
  CBNZ W4, loop
  LDP X2, X1, [reg_containing_x]
  DMB ISH
```

The compiled program has the following outcomes when simulated under the AArch64 model (rename P0:X1 to P0:r0 and P1:X1 to P1:r0 to match with source outcomes):

```
{ P0:r0=0; P1:r0=0; } <--- Forbidden by source model, bug!
{ P0:r0=0; P1:r0=1; }
{ P0:r0=1; P1:r0=0; }
{ P0:r0=1; P1:r0=1; }
```
which is due to the fact the effects of `LDP` on `P1` can be reordered before the effects of `STLXP` on `P1` since there is no leading `DMB` barrier to prevent the reordering.

Since there is no barrier, we propose to fix the bug by adding said barrier before `LDP`:
```
DMB ISH; LDP; DMB ISH
```
Which prevents the buggy outcome under the AArch64 memory model.

Besides using a `DMB`, it is feasible to use `LDAR` - making the SC `LDP` stronger so it no longer reorders with `STLXP.` Note it is also possible to make `STLXP` stronger by adding a `DMB` after the loop (but that's not what is done for other atomic sizes)

I have validated this bug whilst discussing with Wilco from Arm's compiler teams.

This bug would not have been caught in normal execution, but only when multiple implementations are mixed together.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AArch64]: 128-bit Sequentially Consistent load allows reordering before prior store when armv8 and armv8.4 implementations are Mixed #81978

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[AArch64]: 128-bit Sequentially Consistent load allows reordering before prior store when armv8 and armv8.4 implementations are Mixed #81978

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions