-
Notifications
You must be signed in to change notification settings - Fork 168
cortex_m::asm::delay
is wrong on anything more complex than an M4
#236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Even beyond the microarchitecture or CPU model (e.g. Perhaps the most correct solution is to update the documentation to indicate that the method delays for O(n) cycles, or allow user customisation of the scale factor? |
Hm, 4 is indeed an odd choice. I'd probably left the |
Can't get a good upper bound; with knowledge of the microarchitecture you can get a lower bound.
I'm not sure the method would be useful then; if I specify 5000 and it delays for 5000*0.01 cycles, I won't get the results I want. It's currently specified as "at least n" which can be sufficient for a lot of applications (e.g. waiting for an oscillator to stabilize). (I tried to use this to wait the 100us my SDRAM requires, since I need to start that in
On M3 with zero wait states this loop gets you The loop as written probably won't take fewer than 2 cycles on any existing Cortex-M CPU. The ARMv8-M zero-overhead loop extension requires slightly different instruction generation (otherwise, it could do this in 1-2 cycle per iteration). Leaving the nop out would get you Counting cycles is difficult on more complex machines; multiple issue in particular is sensitive to instruction alignment in many cases. The most portable way to get a cycle delay on Cortex-M is to appropriate the SysTick timer for the purpose, but (1) To continue supporting this feature, you're likely to need loops that have been measured on various M implementations, and you will likely need to introduce features to indicate which you're using, like the current feature for cm7 r0p1 (runtime detection with CPUID is needless waste in firmware that isn't meant to be binary-portable...plus I'm also 70% certain CPUID isn't visible to unprivileged code, but I'd have to check). |
Fair point.
CPUID is privileged unfortunately, so that's probably not an option. We don't currently use features for each CPU with the sole exception of cm7-r0p1 for the BASEPRI workaround; we do have flags for each architecture which I suppose we could use for lower bounds, but they would not be especially tight just based on architecture. Adding new features for every CPU just to get a better bound on Edit: perhaps a simpler option would be some a new |
The critical part here is the "at least". The only point of this |
Point taken. What do you think about #240 to rectify that? |
CC #236 Signed-off-by: Daniel Egger <[email protected]>
240: Better delay description r=jonas-schievink a=therealprof CC #236 Signed-off-by: Daniel Egger <[email protected]> Co-authored-by: Daniel Egger <[email protected]>
The contract for
delay
currently states:The current implementation is:
This is approximately an assembly gloss of the code
Why the divide-by-four? Well, the code appears to be assuming that the minimum amount of time that the instruction sequence could take is four cycles, and so it is taking that into account to minimize the amount of over-delaying. That cycle count is approximately accurate on M3/M4 (it's an underestimate in practice, even in zero-wait-state RAM), but is wrong on M7 or any other part with any combination of nop-folding, branch prediction, and multiple issue. As a result, on such CPUs, this function delays for substantially fewer cycles than advertised -- by roughly a factor of 2, depending on cache and memory system configuration, branch alignment, etc. (It also appears to be wrong on M33, but M33 is not superscalar, so it's probably just an instruction timing issue.)
As CPU core complexity grows, any hand-tuned cycle delay loops need to be specialized for the particular microarchitecture being used, much like in the 486/Pentium era. Alternatively, you could use a conservative implementation that, when asked to loop for at least
n
cycles, repeats the loopn
times -- that should be safe for now, but I wouldn't recommend it, particularly because we've got the zero-overhead-loop extension to ARMv8-M on the horizon.Shout out to @labbott for noticing this.
The text was updated successfully, but these errors were encountered: