-
Notifications
You must be signed in to change notification settings - Fork 168
enable_icache
is wrong on M7 without inline-asm
feature
#232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 for rewriting these routines (or at least the critical parts of them) in assembly, I don't think we can rely on LLVM to always produce the right instruction sequence here. |
Ah, as soon as I saw the email subject I realised exactly what had happened. Thanks for the detailed report. On the whole you're right and this will need to be fixed by moving the routine to assembly. We're really desperate for proper thumb intrinsics! Or I guess the new inline assembly would also satisfy.
https://github.com/rust-embedded/cortex-m/blob/master/src/asm.rs#L221 ;) As it happens the 'extern C' does add something with the effect of a memory clobber so should prevent reordering of memory access in the output, but an explicit fence wouldn't hurt. We had much the same issue recently around ordering entering/leaving a CriticalSection. |
For the record this is the same problem people hit more often when using the externally linked 'intrinsics' to modify the stack pointer (e.g. to bootload) where the |
Any comment asking if a barrier is required should be replaced by a barrier. :-) You might lose some performance, but you're less likely to lose correctness. I largely agree with @jonas-schievink, though I would go one step farther and suggest that But if that seems too disruptive, then yes, going through every use and auditing why it's using a barrier, which ARM documentation it was following, whether the instruction order is actually significant, whether it's using the correct combination of machine and compiler fences, and then whether it needs to be rewritten in assembly anyway, would also work. |
There are definitely use cases where I suspect every case using those barriers needs to prevent memory reordering across the barrier, but at present either the function call is an extern function call (which implies a memory clobber)* or has an explicit memory clobber in the inline asm, which will prevent memory reordering around the call*. * gosh, i hope so, anyway |
I'm pretty sure that's not a thing, except for code that is cargo-culting |
Okay, I admit that I have found a case where What about marking |
"memory trouble? apply patented tincture
I think that's a bug in the How about:
|
sgtm.
I'm concerned that this doesn't proactively alert any current users that their code is wrong, the way a deprecation might. But, yes. This would be good.
Yes, if you can cite docs, the barriers should be there. It looks like the cortex-m crate is using these barriers primarily in cache cases, so they are probably fine. (I'll file a separate bug on how
I have definitely had intermittent failures when enabling the dcache (in fact, merely enabling the icache was not enough to trigger this in my code). I'm not sure if it was the dcache sequence or the interaction of the two, though. I'd use the asm snippet and the barriers in both cases. In reading these cache maintenance functions -- are they written assuming they are executing in critical sections? Currently you could take an interrupt after enabling/disabling icache and before the barriers, which would be bad. (Worst case, the interrupt could have occurred shortly before the operation, and so the next instruction in the pipeline begins the interrupt handler.) Likewise between disabling dcache and cleaning. These are privileged operations, so the standard
Yeah, it really wants its |
What would you replace all the instrinsics with, if deprecating them? Or are you only thinking about ISB/DSB (which we'd still need to offer somehow, of course)?
It would be interesting to work out what's causing the failure here in case we need to put more things into the asm blob. Is it some interaction with the stack when dcache is getting turned on/off? Do we end up needing to do cache cleaning+disable all in asm in case a call hits the stack between disable and clean?
The ARM CMSIS cache management methods do not enter a critical section or talk about them in the documentation, but clearly taking an interrupt in the middle of this sequence would be ruinous, so I wonder what's up with that. We could consider adding a critical section to all the cache management operations, but the overhead might be annoying in the more frequently executed methods. |
Oh, sorry, I brought up deprecation as an example of how to communicate an issue to existing users. I don't know that you want to deprecate the existing intrinsics. Unless they prove unsalvageable of course.
Probably. Any sequence provided as a literal assembly listing in either the v7-M ARM or the M4/M7 TRMs needs to be read very carefully before being rewritten in a higher-level language. I'd have to squint at cleaning and invalidation particularly.
Vendor code is often wrong. :-) I'm glad you are also seeing the issue there, I was concerned that I had missed something. In a higher-level crate, if being in a critical section gives you a witness type, you could require that type to be passed in, proving a critical section exists, to avoid the overhead. (I felt like the |
The CMSIS routines ARM recommends for cache management are fully in C so at least in principle I feel like we should be able to get Rust to do the right thing, the only caveat being the DSB/ISB intrinsics. I wonder if we can come to a clearer understanding of what precisely is causing the fault behaviour and so where a branch must be avoided and where it's acceptable; I'd rather not just rewrite the entire cache management in assembler just to be sure.
That's still the case; |
This is a case where the correctness of the code, if written in a higher-level language, will require some understanding of compiler behavior.
I agree that you could, in principle, write C code in Rust (i.e. avoiding register abstractions, using raw pointer writes) that would compile similarly in [1]: note that I'm pretty sure I have observed the compiler out-lining The lack of a critical section requirement in the CMSIS code still appears to be a bug/omission in the CMSIS code. Based on past (and recent) experience I'd caution against assuming CMSIS code is correct without a deep understanding of how ARMCC processes it -- particularly in cases like this where the ARM recommendation interacts with vendor-implemented memory hierarchies and the like.
I'll have a go at writing a reduction. But in the case of the basic cache maintenance operations, the ARM docs (which you excerpted) are pretty clear, and I'd caution against assuming that you can control instruction generation too precisely through several levels of indirection/compiler. It seems potentially fragile. (For instance, my Re: witness types,
Oh good! So that would be an opportunity for a low-overhead safe API for unprivileged/monolithic apps. |
We cannot currently write the equivalent to the C code in Rust due to lack of stable ARM intrinsics and inline assembly - either of the two would be needed to guarantee that |
I definitely agree we need to at least write the parts of the sequence that must not have a branch in the middle in assembly (e.g. enabling icache followed by ISB); I'm just wondering if we will also have to write all the rest of the cache maintenance (e.g. the clean and invalidate loops and so on) in asm too. Some of it's at least marginally nontrivial (e.g. invalidate_dcache_by_address). |
I've got an intermittent repro case, but it's intermittent. I've gotten it to crash something like 1% of the time. Machine state at crash looks suspiciously similar to this instability in Linux early in its M7 port. I'm reviewing the patches that followed that to see what they changed. (Note that Linux's cache maintenance operations are all in assembly.) Instruction sequence around the crash is
The word where the fault occurs seems to read as zero when the fault happens, rather than the correct instruction. Edit: Ah, the fault is reproducible by running a different non-cached binary that does not have |
Are you testing with dcache on/off/both, and also is the branch predictor enabled? |
I've tested with both, but in this case it is crashing immediately before attempting to enable the dcache. Branch predictor is on, though I didn't request it, perhaps the runtime is enabling it? Dump of the M7's IEBR registers shows
Only interesting contents are in IEBR1 there, but IEBR1 sez:
So there has been a cache tag ECC error. If I'm doing my set-associative cache math right, that looks like it's on the line immediately after where the dsb function should be, which (given that the branch went to the last two bytes of the line, containing the first two bytes of a four-byte instruction) isn't totally shocking. CFSR says usage error (invalid isntruction). HFSR says forced (usage fault escalated). Not super useful there. So it seems like a branch racing instruction cache initialization is bad. It would be harder to demonstrate similar issues on dcache initialization because you'd have to get a load to issue at the right time, but with the M7's ability to dual-issue loads it ought to be doable if you think it's important. (Me, I say let's take ARM's advice and not put instructions in between cache enable and barrier.) I have no data on other cache operations yet. As for the invalidate-by-address operation, as long as none of the functions it calls are transitively referencing any |
See rust-embedded#232, which this partially fixes -- there's still the question of taking an interrupt in the midst of these sequences.
Got a draft PR up -- this doesn't address the need for a critical section. It would be safe to simply use cps in the assembly sequences, because the SCB is only accessible to privileged code anyway. If you have no objections I'll amend the PR. |
Thanks! Do you think those are the only two methods which require writing in assembly then? Re the critical section, if you did just use cps in the assembly you'd still need to store the current state before restoring in case we're already operating with interrupts off; maybe better to just put a |
No, I think those are the only two that we appear to agree need to be in assembly. :-) I'd do the other cache maintenance functions as I've said.
Yes, of course. The normal pattern is
Advantage: assembly routine becomes more clearly correct; interrupts are disabled for a shorter and more predictable period. Disadvantage: critical section hidden in assembly. (Edit: additional disadvantage: y'all appear to be replicating all assembly sequences in two places without mechanisms to prevent divergence(?), so I guess any longer sequence is a risk.) I could go either way. Got a preference? |
See rust-embedded#232, which this partially fixes -- there's still the question of taking an interrupt in the midst of these sequences.
See rust-embedded#232, which this partially fixes -- there's still the question of taking an interrupt in the midst of these sequences.
See rust-embedded#232, which this partially fixes -- there's still the question of taking an interrupt in the midst of these sequences.
See rust-embedded#232, which this partially fixes -- there's still the question of taking an interrupt in the midst of these sequences.
234: Use assembly sequences to enable caches. r=adamgreig a=cbiffle See #232, which this partially fixes -- there's still the question of taking an interrupt in the midst of these sequences. Co-authored-by: Cliff L. Biffle <[email protected]>
(for future reference, further discussion was in #234). Thanks very much for reporting this and getting the fix in; I know the crate hasn't seen enough use with priv/unpriv use cases and I'm glad to get fixes merged to help that. |
See #232, which this partially fixes -- there's still the question of taking an interrupt in the midst of these sequences.
See #232, which this partially fixes -- there's still the question of taking an interrupt in the midst of these sequences.
Hi All, I found this thread while looking for related topics. Have not tested, only looked. I think the actual problem with the code that was being inserted before the The inserted code was wrong to have been placed there (adding compiler barriers to The code as-is today could use additional memory barriers per Arm references, and the exception masking is likely unnecessary. This code could still crash if an exception were to be taken after DetailsCortex-M7 does not have automatic cache invalidation, so the cache contents are undefined after reset. Before enabling the instruction cache or data cache, they must be invalidated. The code must wait for this invalidation to complete by using a On Because on CMSIS 4 had this same missing Also note that Refer to CMSIS 5 or 6 for examples on both topics. I do not think disabling exceptions is required for cache maintenance. If lookup is enabled by An application is still ultimately responsible for ensuring cache maintenance is complete, whenever the application expects it to be completed. Code
References
|
Uh oh!
There was an error while loading. Please reload this page.
This has been a fun one to track down.
tl;dr: if enabling instruction and/or data caches on your M7 is causing weird crashes, turn on the
cortex_m
inline-asm
feature and make sure you're compiling at--release
. Oh, and you're on nightly now. Sorry.Details
I've got an STM32H7B3 here, and shortly after enabling the icache I would sometimes get hard faults that looked suspiciously like it trying to execute random data. (I'm doing this early enough in boot that I don't have configurable faults broken out, but they were either BF or MMF.)
The key here is sometimes -- some builds worked okay. Until I tried turning on dcache. Then things went squirrely.
It came down to: how many instructions did the compiler decide to insert between turning on the cache and the
dsb
/isb
sequence? If the answer is not zero and contains a branch or memory access, weird things will happen.The current implementation of
enable_icache
reads as follows:The tail end of that function is almost right.
modify
will end in astr
.However,
compiler_fence
or equivalent to prevent instructions from being rearranged above the barriers. (Perhaps we can assume this won't happen because they'reextern "C"
? I'm fairly certain I just saw it happen but I didn't save the listing.)dsb
andisb
default to being function callsThis means the actual dynamic instruction trace here is
In other words, we've got one branch in the pipeline before the write even takes effect, and no fewer than three before the
isb
causes a prefetch/pipeline flush. These instructions often execute incorrectly. (On a reasonably fast M7, "often" becomes "always" in my testing.)Enabling the
inline-asm
feature fixes this by changing the dynamic trace to be what you'd expect:This is probably an issue with any existing code attempting to use
isb
to synchronize with memory map or cache changes. To do this correctly, one might consider makingenable_icache
(andenable_dcache
though it crashes less reliably) contingent on theinline-asm
feature, but that only produces the right instruction sequence at--release
.These routines probably need to be written in assembly to be correct on stable toolchains (and on debug builds on nightly toolchains, for that matter, to bind the compiler's hands). I haven't thought about the equivalent
disable
cases in detail, but they and the clean/invalidate routines probably bear review as well.The text was updated successfully, but these errors were encountered: