`cortex_m::asm::delay` is wrong on anything more complex than an M4 #236

cbiffle · 2020-07-05T18:00:58Z

The contract for delay currently states:

Blocks the program for at least n instruction cycles

The current implementation is:

            llvm_asm!("1:
                  nop
                  subs $0, $$1
                  bne.n 1b"
                 : "+r"(_n / 4 + 1)
                 :
                 : "cpsr"
                 : "volatile");

This is approximately an assembly gloss of the code

let mut i = n / 4 + 1;
loop {
    i -= 1;
    if i == 0 { break }
}

Why the divide-by-four? Well, the code appears to be assuming that the minimum amount of time that the instruction sequence could take is four cycles, and so it is taking that into account to minimize the amount of over-delaying. That cycle count is approximately accurate on M3/M4 (it's an underestimate in practice, even in zero-wait-state RAM), but is wrong on M7 or any other part with any combination of nop-folding, branch prediction, and multiple issue. As a result, on such CPUs, this function delays for substantially fewer cycles than advertised -- by roughly a factor of 2, depending on cache and memory system configuration, branch alignment, etc. (It also appears to be wrong on M33, but M33 is not superscalar, so it's probably just an instruction timing issue.)

As CPU core complexity grows, any hand-tuned cycle delay loops need to be specialized for the particular microarchitecture being used, much like in the 486/Pentium era. Alternatively, you could use a conservative implementation that, when asked to loop for at least n cycles, repeats the loop n times -- that should be safe for now, but I wouldn't recommend it, particularly because we've got the zero-overhead-loop extension to ARMv8-M on the horizon.

Shout out to @labbott for noticing this.

The text was updated successfully, but these errors were encountered:

adamgreig · 2020-07-05T21:11:11Z

Even beyond the microarchitecture or CPU model (e.g. CPUID.PartNo), if the current cache configuration, what memory the code is being read from, branch prediction, etc, all have a strong effect on cycles-per-loop it seems we can't hope to get a good bound here.

Perhaps the most correct solution is to update the documentation to indicate that the method delays for O(n) cycles, or allow user customisation of the scale factor?

therealprof · 2020-07-05T21:45:14Z

Hm, 4 is indeed an odd choice. I'd probably left the nop out and gone for 2. I don't think you can execute the loop in less than 2 cycles on any current MCU, can you?

cbiffle · 2020-07-05T22:32:32Z

Even beyond the microarchitecture or CPU model (e.g. CPUID.PartNo), if the current cache configuration, what memory the code is being read from, branch prediction, etc, all have a strong effect on cycles-per-loop it seems we can't hope to get a good bound here.

Can't get a good upper bound; with knowledge of the microarchitecture you can get a lower bound.

Perhaps the most correct solution is to update the documentation to indicate that the method delays for O(n) cycles

I'm not sure the method would be useful then; if I specify 5000 and it delays for 5000*0.01 cycles, I won't get the results I want. It's currently specified as "at least n" which can be sufficient for a lot of applications (e.g. waiting for an oscillator to stabilize).

(I tried to use this to wait the 100us my SDRAM requires, since I need to start that in pre_init and don't have timers etc. yet. But I'm on an M7, so it's a non-starter.)

Hm, 4 is indeed an odd choice. I'd probably left the nop out and gone for 2. I don't think you can execute the loop in less than 2 cycles on any current MCU, can you?

On M3 with zero wait states this loop gets you 3 + 5*(n-1) cycles for n > 0, since bne is cheaper when not taken on the final iteration than when taken. (I believe it's the same on M4 but I'd need to check my notes.) I'm not sure about the rationale for the nop, but it may have been someone attempting to pad execution time up to 4 cycles for a cheaper divide (since there's the comment about 4 being cheap to divide-by). Unfortunately, that seems to assume that b is a single cycle, which it isn't on M3/4.

The loop as written probably won't take fewer than 2 cycles on any existing Cortex-M CPU. The ARMv8-M zero-overhead loop extension requires slightly different instruction generation (otherwise, it could do this in 1-2 cycle per iteration).

Leaving the nop out would get you 3+4*(n-1) cycles on M3.

Counting cycles is difficult on more complex machines; multiple issue in particular is sensitive to instruction alignment in many cases.

The most portable way to get a cycle delay on Cortex-M is to appropriate the SysTick timer for the purpose, but (1) cortex-m should not assume it has control of that peripheral, and (2) it's not accessible to unprivileged code. The cycle counter in the DWT could also be of use, but it's optional hardware.

To continue supporting this feature, you're likely to need loops that have been measured on various M implementations, and you will likely need to introduce features to indicate which you're using, like the current feature for cm7 r0p1 (runtime detection with CPUID is needless waste in firmware that isn't meant to be binary-portable...plus I'm also 70% certain CPUID isn't visible to unprivileged code, but I'd have to check).

adamgreig · 2020-07-05T22:55:36Z

I'm not sure the method would be useful then; if I specify 5000 and it delays for 5000*0.01 cycles, I won't get the results I want. It's currently specified as "at least n" which can be sufficient for a lot of applications (e.g. waiting for an oscillator to stabilize).

Fair point.

To continue supporting this feature, you're likely to need loops that have been measured on various M implementations, and you will likely need to introduce features to indicate which you're using, like the current feature for cm7 r0p1 (runtime detection with CPUID is needless waste in firmware that isn't meant to be binary-portable...plus I'm also 70% certain CPUID isn't visible to unprivileged code, but I'd have to check).

CPUID is privileged unfortunately, so that's probably not an option. We don't currently use features for each CPU with the sole exception of cm7-r0p1 for the BASEPRI workaround; we do have flags for each architecture which I suppose we could use for lower bounds, but they would not be especially tight just based on architecture. Adding new features for every CPU just to get a better bound on delay doesn't seem like a worthwhile complexity tradeoff. On the other hand, using architecture alone puts Cortex-M3 and M4 and M7 all in the same class, which is a bit annoying from this point of view.

Edit: perhaps a simpler option would be some a new delay_cm7() method and friends, where the existing delay() is a worst case lower bound using the architecture and delay_cm7() can be an improved bound for CM7-specific code, without having to add lots of new features. Of course, non-CM7 code might call the CM7 version, which would result in shorter-than-expected delays.

therealprof · 2020-07-05T23:45:19Z

The critical part here is the "at least". The only point of this delay is to provide a hardware independent and timer less delay to be used for initialisation and I think we can easily fulfil that requirement by providing an useful lower bound valid for all MCUs. If accuracy is important use a timer. I was under the impression that 3 cycles per loop iteration was this lower bound (although adding a nop and calling it 4 seems super silly), removing the nopand assuming there're MCUs which can execeute the loop in 2 cycles seems like the most promising way to go forward to me.

cbiffle · 2020-07-08T21:34:58Z

The only point of this delay is to provide a hardware independent and timer less delay to be used for initialisation

The API docs on delay do not describe it this way. Might be worth updating.

Point taken. What do you think about #240 to rectify that?

CC #236 Signed-off-by: Daniel Egger <[email protected]>

240: Better delay description r=jonas-schievink a=therealprof CC #236 Signed-off-by: Daniel Egger <[email protected]> Co-authored-by: Daniel Egger <[email protected]>

jonas-schievink added the bug label Jul 5, 2020

cbiffle mentioned this issue Jul 6, 2020

Proposal for separating cortex-m into smaller parts #239

Open

therealprof added a commit that referenced this issue Jul 8, 2020

Better delay description

ac64691

CC #236 Signed-off-by: Daniel Egger <[email protected]>

therealprof mentioned this issue Jul 8, 2020

Better delay description #240

Merged

bors bot added a commit that referenced this issue Jul 8, 2020

Merge #240

32634e4

240: Better delay description r=jonas-schievink a=therealprof CC #236 Signed-off-by: Daniel Egger <[email protected]> Co-authored-by: Daniel Egger <[email protected]>

glindstedt mentioned this issue Nov 14, 2020

Alternative Delay implementation stm32-rs/stm32h7xx-hal#169

Closed

jonas-schievink mentioned this issue Jan 7, 2021

Fix timing of asm-based delay implementation #312

Merged

bors bot closed this as completed in 86cd463 Jan 7, 2021

adamgreig mentioned this issue Jan 28, 2021

asm::delay blocks for 1.5 times longer on 0.6.5 #325

Open

jordens mentioned this issue Apr 8, 2022

AsmDelay delays are about 1/4 the time they should be quartiq/thermostat-eem#27

Closed

adamgreig mentioned this issue Apr 14, 2022

asm::delay() only blocks half the time intended on an STM32H7 #430

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`cortex_m::asm::delay` is wrong on anything more complex than an M4 #236

`cortex_m::asm::delay` is wrong on anything more complex than an M4 #236

cbiffle commented Jul 5, 2020

adamgreig commented Jul 5, 2020

Uh oh!

therealprof commented Jul 5, 2020

Uh oh!

cbiffle commented Jul 5, 2020 •

edited

Loading

Uh oh!

adamgreig commented Jul 5, 2020 •

edited

Loading

Uh oh!

therealprof commented Jul 5, 2020

Uh oh!

cbiffle commented Jul 8, 2020 •

edited by therealprof

Loading

Uh oh!

cortex_m::asm::delay is wrong on anything more complex than an M4 #236

cortex_m::asm::delay is wrong on anything more complex than an M4 #236

Comments

cbiffle commented Jul 5, 2020

adamgreig commented Jul 5, 2020

Uh oh!

therealprof commented Jul 5, 2020

Uh oh!

cbiffle commented Jul 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamgreig commented Jul 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

therealprof commented Jul 5, 2020

Uh oh!

cbiffle commented Jul 8, 2020 • edited by therealprof Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

`cortex_m::asm::delay` is wrong on anything more complex than an M4 #236

`cortex_m::asm::delay` is wrong on anything more complex than an M4 #236

cbiffle commented Jul 5, 2020 •

edited

Loading

adamgreig commented Jul 5, 2020 •

edited

Loading

cbiffle commented Jul 8, 2020 •

edited by therealprof

Loading