Skip to content

cortex_m::asm::delay is wrong on anything more complex than an M4 #236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cbiffle opened this issue Jul 5, 2020 · 6 comments · Fixed by #312
Closed

cortex_m::asm::delay is wrong on anything more complex than an M4 #236

cbiffle opened this issue Jul 5, 2020 · 6 comments · Fixed by #312
Labels

Comments

@cbiffle
Copy link
Contributor

cbiffle commented Jul 5, 2020

The contract for delay currently states:

Blocks the program for at least n instruction cycles

The current implementation is:

            llvm_asm!("1:
                  nop
                  subs $0, $$1
                  bne.n 1b"
                 : "+r"(_n / 4 + 1)
                 :
                 : "cpsr"
                 : "volatile");

This is approximately an assembly gloss of the code

let mut i = n / 4 + 1;
loop {
    i -= 1;
    if i == 0 { break }
}

Why the divide-by-four? Well, the code appears to be assuming that the minimum amount of time that the instruction sequence could take is four cycles, and so it is taking that into account to minimize the amount of over-delaying. That cycle count is approximately accurate on M3/M4 (it's an underestimate in practice, even in zero-wait-state RAM), but is wrong on M7 or any other part with any combination of nop-folding, branch prediction, and multiple issue. As a result, on such CPUs, this function delays for substantially fewer cycles than advertised -- by roughly a factor of 2, depending on cache and memory system configuration, branch alignment, etc. (It also appears to be wrong on M33, but M33 is not superscalar, so it's probably just an instruction timing issue.)

As CPU core complexity grows, any hand-tuned cycle delay loops need to be specialized for the particular microarchitecture being used, much like in the 486/Pentium era. Alternatively, you could use a conservative implementation that, when asked to loop for at least n cycles, repeats the loop n times -- that should be safe for now, but I wouldn't recommend it, particularly because we've got the zero-overhead-loop extension to ARMv8-M on the horizon.

Shout out to @labbott for noticing this.

@adamgreig
Copy link
Member

Even beyond the microarchitecture or CPU model (e.g. CPUID.PartNo), if the current cache configuration, what memory the code is being read from, branch prediction, etc, all have a strong effect on cycles-per-loop it seems we can't hope to get a good bound here.

Perhaps the most correct solution is to update the documentation to indicate that the method delays for O(n) cycles, or allow user customisation of the scale factor?

@therealprof
Copy link
Contributor

Hm, 4 is indeed an odd choice. I'd probably left the nop out and gone for 2. I don't think you can execute the loop in less than 2 cycles on any current MCU, can you?

@cbiffle
Copy link
Contributor Author

cbiffle commented Jul 5, 2020

Even beyond the microarchitecture or CPU model (e.g. CPUID.PartNo), if the current cache configuration, what memory the code is being read from, branch prediction, etc, all have a strong effect on cycles-per-loop it seems we can't hope to get a good bound here.

Can't get a good upper bound; with knowledge of the microarchitecture you can get a lower bound.

Perhaps the most correct solution is to update the documentation to indicate that the method delays for O(n) cycles

I'm not sure the method would be useful then; if I specify 5000 and it delays for 5000*0.01 cycles, I won't get the results I want. It's currently specified as "at least n" which can be sufficient for a lot of applications (e.g. waiting for an oscillator to stabilize).

(I tried to use this to wait the 100us my SDRAM requires, since I need to start that in pre_init and don't have timers etc. yet. But I'm on an M7, so it's a non-starter.)

Hm, 4 is indeed an odd choice. I'd probably left the nop out and gone for 2. I don't think you can execute the loop in less than 2 cycles on any current MCU, can you?

On M3 with zero wait states this loop gets you 3 + 5*(n-1) cycles for n > 0, since bne is cheaper when not taken on the final iteration than when taken. (I believe it's the same on M4 but I'd need to check my notes.) I'm not sure about the rationale for the nop, but it may have been someone attempting to pad execution time up to 4 cycles for a cheaper divide (since there's the comment about 4 being cheap to divide-by). Unfortunately, that seems to assume that b is a single cycle, which it isn't on M3/4.

The loop as written probably won't take fewer than 2 cycles on any existing Cortex-M CPU. The ARMv8-M zero-overhead loop extension requires slightly different instruction generation (otherwise, it could do this in 1-2 cycle per iteration).

Leaving the nop out would get you 3+4*(n-1) cycles on M3.

Counting cycles is difficult on more complex machines; multiple issue in particular is sensitive to instruction alignment in many cases.

The most portable way to get a cycle delay on Cortex-M is to appropriate the SysTick timer for the purpose, but (1) cortex-m should not assume it has control of that peripheral, and (2) it's not accessible to unprivileged code. The cycle counter in the DWT could also be of use, but it's optional hardware.

To continue supporting this feature, you're likely to need loops that have been measured on various M implementations, and you will likely need to introduce features to indicate which you're using, like the current feature for cm7 r0p1 (runtime detection with CPUID is needless waste in firmware that isn't meant to be binary-portable...plus I'm also 70% certain CPUID isn't visible to unprivileged code, but I'd have to check).

@adamgreig
Copy link
Member

adamgreig commented Jul 5, 2020

I'm not sure the method would be useful then; if I specify 5000 and it delays for 5000*0.01 cycles, I won't get the results I want. It's currently specified as "at least n" which can be sufficient for a lot of applications (e.g. waiting for an oscillator to stabilize).

Fair point.

To continue supporting this feature, you're likely to need loops that have been measured on various M implementations, and you will likely need to introduce features to indicate which you're using, like the current feature for cm7 r0p1 (runtime detection with CPUID is needless waste in firmware that isn't meant to be binary-portable...plus I'm also 70% certain CPUID isn't visible to unprivileged code, but I'd have to check).

CPUID is privileged unfortunately, so that's probably not an option. We don't currently use features for each CPU with the sole exception of cm7-r0p1 for the BASEPRI workaround; we do have flags for each architecture which I suppose we could use for lower bounds, but they would not be especially tight just based on architecture. Adding new features for every CPU just to get a better bound on delay doesn't seem like a worthwhile complexity tradeoff. On the other hand, using architecture alone puts Cortex-M3 and M4 and M7 all in the same class, which is a bit annoying from this point of view.

Edit: perhaps a simpler option would be some a new delay_cm7() method and friends, where the existing delay() is a worst case lower bound using the architecture and delay_cm7() can be an improved bound for CM7-specific code, without having to add lots of new features. Of course, non-CM7 code might call the CM7 version, which would result in shorter-than-expected delays.

@therealprof
Copy link
Contributor

The critical part here is the "at least". The only point of this delay is to provide a hardware independent and timer less delay to be used for initialisation and I think we can easily fulfil that requirement by providing an useful lower bound valid for all MCUs. If accuracy is important use a timer. I was under the impression that 3 cycles per loop iteration was this lower bound (although adding a nop and calling it 4 seems super silly), removing the nopand assuming there're MCUs which can execeute the loop in 2 cycles seems like the most promising way to go forward to me.

@cbiffle
Copy link
Contributor Author

cbiffle commented Jul 8, 2020

The only point of this delay is to provide a hardware independent and timer less delay to be used for initialisation

The API docs on delay do not describe it this way. Might be worth updating.

Point taken. What do you think about #240 to rectify that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants