-
-
Notifications
You must be signed in to change notification settings - Fork 86
SPI with the R4 is slower than with the R3 #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Fix: SPI not re-initialized when `end()` being called due to SPI timeout.
Hi @RudolphRiedel ☕ 👋 Thank you for taking the time to take those measurements and report them 👍 . The main motivation for using the FSP layer is to achieve maximum portability across various Renesas platforms. That being said, the performance penalty incurred simply sucks. We'll be looking into ways to enhance SPI performance. |
I am afraid the this is not the only bottleneck that FSP presents but SPI is kind of "my thing" right now and in general something like this really is not unique to FSP. So I sure will be having some fun going around the SPI class as I did before with a number of other controllers. On the Arduino side it really would be nice to have a common API that supports SPI transmits over DMA. |
No comments on DMA but if you'd be willing to work on PR improving the performance for the SPI module I'd be happy to check and merge it. Certainly we - Arduino - don't want stay of valueable community contributions. If SPI is your thing, I kindly invite you to bring it on! 🚀 |
I can try, I wanted to play with the RA4M1 anyways. |
I'm looking forward to see what you're coming up with 😉 |
I posted a comparison of Uno R3 vs R4 SdFat performance on the Arduino form. Also Saleae capture of the SPI signals here: https://forum.arduino.cc/t/uno-r4-poor-spi-performance/1143935 Summary: the R3 is about eight times faster than the R4 for the current version of SdFat..
The clock rate for the R4 is 24 MHz but there is a huge gap of about 11 μs between bytes. The Arduino SD.h library which is based on a modified 2009 version of SdFat is also very slow on the R4. I looked at the Renesas FSP SPI implementation. I suspect there is little hope of decent performance at 24 MHz without DMA. For fast SPI, it would be nice if Arduino added this member function to the SPI API. It is in many third party board support packages. The current
If txBuf is nullptr, 0XFF is sent. If rxBuf is nullptr receive data is discarded. I looked at the RA4M1 SPI hardware. It has 32-bit RX and TX buffers so it may be possible to get close to 24 MHz with programmed I/O. The Hardware has a feature that prevents overruns by clock stretching. See my above post for cases where the clock is stopped in FSP and the eighth bit is delayed for about 2 μs. This is great to avoid overruns when using both transmit and receive buffering. I have a mod for the Arduino SAMD21 core that gets close to full speed at 12 MHz using 8-bit SPI I/O. It is about three times faster than the Arduino core functions with SdFat. Here is the small but tricky implementation that takes advantage of buffer registers in the SAMD21/SAMD51.
Arduino Core performance:
With the above function:
It would be nice to have something like this for Uno R4. |
I have a couple ideas that I need to try out and I will start with the standard API before venturing into non standard API options. I am actually quite familiar with the SERCOMs from the ATSAM but this will not be of much help here. :-) |
The point of the SAMD code was that the RA4M1 has similar RX/TX buffer registers so it should be possible to do array send/receive in a loop at fairly high speed. Even more interesting is the fact that the RA4M1 has a 32 bit mode so you should be able to transfer the bulk of an array using 32-bit mode. Looks like you are using 8MHz clock to send a 228 byte array in 634µs. What rate do you get for 24MHz? I tried the R4 unmodified array transfer at 24 MHz and get about 650µs. This is for both send and receive. Here is my test program. I looked at it with Salese and the array takes 635µs but the print is 656µs. The clock is 24 MHz.
It has the strange clock stretching with seven pulses at 24 MHz then the final pulse much later. I ran it at 8MHz and the time is the same as at 24 MHz so it is totally CPU limited. The clock looks like your clock. |
It looks vaguely like things would get better if the fsp libraries were compiled to use DMA for SPI, even if they continue to block until the end of the transfer... |
DMA is the best answer. I did DMA for a number of MCUs for use in my Arduino SdFat library. With my Due DMA function I get:
Due with the standard Arduino library using
Due with the standard Arduino library single byte
The RA4M1 seems to support SPI reliably at 24MHz clock so I would hope to see something close to half the Due rate with SdFat. This is the result for Uno R4 with the current Arduino library array transfer function at 24 MHz:
I would hope to get over 2,000 KB/sec with DMA on Uno R4. Even if DMA is not used, it should be possible to optimize array transfer using the buffering in the SPI controller better. Here is the improvement I got in SAMD21 at 12 MHz: MRK ZERO Arduino standard library array transfer:
MRK ZERO with the optimized array transfer function in the above post:
|
The comment by WestfW about blocking with the current SPI API, even if you are using DMA, convinced me DMA is necessary and an improved API is needed. I ran the Uno R4 FreeRTOS-Blink example. It is wonderful to see Arduino support an RTOS on Uno R4. Next the libraries need to be RTOS friendly. The SPI API need a clean way to implement non-blocking transfers. Most RTOS APIs support a more general array transfer. This is becoming common, sometimes with a callback parameter:
|
I decided to try RudolphRiedel's program at 24 MHz. Looks like it is fairly good. About 109µs to transfer a 228 byte buffer. That's over 2000KB/sec. This is way faster than an Uno R3. I am now going to try using 32-bit transfers for all except mod 4 of the count. That should allow fast combined write/read transfers. |
I also discovered the datasheet is incomplete and often wrong. There are even whole registers, not just bit fields that are not documented. I am reduced to searching all of the FSP source for register names/addresses and bit fields. I wrote a program to dump all the registers and use it to see what bits are used in the current SPI library. I fought with 32-bit transfers because the bytes were not sent in the right order. I found an undocumented register, SPDCR2, that has an endian selection bit. I also can find some definitions of fields by looking at datasheets for newer members of the RA family. UGH! I now have array transfer working with 32-bit transfers. I can send a 228 byte array in 84µs, or about 2700 KB/Sec. Close enough to the 3000 KB/sec max for 24 MHz clock. I should be able to do array send/receive since the time for one 32-bit transfer at 24MHz is 64 CPU cycles. Much better than 16 CPU cycle for 8-bit transfers. |
Unbelieveable. This really makes you wonder why Renesas is sending developers on a scavenger hunt instead of fixing the documents. This is not really stable though so far. Oh, one other thing, at least the original API needs to use 8 bit mode. I am using buffer transfers for 32 bit and this is faster over single byte transfers, even with 8 bit transfers. DMA transfers also need to use 8 bit to support odd number of bytes and not aligned buffers, but then there shouldn't be any pauses left between the bytes. |
It is implemented. I found where is is disabled in the current SPI library. When I enabled it the bytes were in the correct order.
The API is void*. You just say how many bytes to transfer. I just send as much as possible as 32-bit transfers. The Cortex-M4 processor supports ARMv7 unaligned accesses so it works fine in non-DMA mode. I have done this on other MCUs. I could transfer only the part that is on word boundaries but my plan is to transfer the max amount in 32-bit mode and remaining bytes in 8-bit mode. I am implementing Either txBuf or rxBuf may be nullptr to implement pure read or write. When you write to a card you don't want to write over the sent data. When you read from a SD card you must send 0XFF bytes since the card looks for a possible command. For the Arduino array transfer I must copy data to a tmp buffer for write and fill the buffer with 0XFF for read.
void transfer(void* buf, size_t count) { transfer(buf, buf, count); } |
This is not what I meant, it is possible that the register is not implemented in all revisions of the controller and even if we happen to have the most current one it may be gone in the next.
This is something to explore, for short buffers the overhead might be too much.
I have no idea why there even is transfer(void *buf, size_t count) in the standard API. Which reminds me that I need to change transfer(void *buf, size_t count) from write-only to write/read. |
I pushed out a first draft version: This only works with SPI for now, not SCI. |
Oh, as a side note, there is PlatformIO package now: https://github.com/maxgerhardt/platform-renesas.git |
And here is the SPI trace for my first SPI.cpp draft: This is now down to 337µs for the buffer of 3+228 bytes and this is with no change to the API. How open is Arduino anyways in regards of additional functions? |
Changing the standard API is probably a big thing. Adding a function for just the R4 core probable won't fly. That's why I haven't pushed for the one I like. I got a first cut of my SdFat bench result:
The problem now is I am using the original single byte transfer to send commands and every byte takes 13µs. A 512 byte block takes 190µs or almost 2700 KB/sec. I guess I need to grab your version of This already more than a factor of four improvement over the Standard library array transfer.
Here is the code. I have only tested it with SdFat.
|
I picked up your idea of switching to 32 bit transfers and put it in transfer(void *buf, size_t count). My buffer is transferred in 266µs now, that is another 20% faster. SPI_UNO_R4_spi_class_draft.zip This is a combination of a buffer transfer and two single byte transfers, this takes less time in total now than the pause between two single bytes transfers with the original SPI.cpp. |
How fast is your array transfer at 24 MHz. For SD cards the time for 512 bytes is the key number. That will be what I will need to use in SdFat if Arduino accepts your mods. Unfortunately I need to copy data to a dummy buffer for write and fill the buffer with 0XFF for read so I will never get the max possible speed. I am not even going to try getting my mods accepted. I have a number of mods for various boards that I provide to users who need high performance. I am going to try making the 1-3 bytes a single transfer. The controller can transfer almost any number of bits. I will need to use memcpy to fetch and store the bytes. |
I have it running for about three hours now at 24 MHz and the bigger block needs about 117µs,
My block ends on three bytes, there is a pause of 680ns between 32 bit transfers, 920ns between the last 32 bit transfer and the first 8 bit transfer and 520ns between the 8 bit transfers. So, about 240ns for the switch back from 32 to 8, perhaps 320ns with extra logic. At some point one needs to give up on optimizing further. :-) Switching over to DMA though would reduce the transfer time overall by about 39µs for the buffer I am transferring. Even switching to write-only without DMA should make this about 13µs faster. |
What is the time for the 228 byte transfer in full duplex? My function does the transfer in about 85µs. With all overhead and the calls to micros() this code prints 96µs:
With 512 bytes it prints 202µs. |
The 117µs is in full duplex of 3+228 bytes and with the default -Os. This prints 118: uint32_t m = micros(); And 120 for txBuf[231]. A quick mod to half-duplex brought this down to 94 but it stopped working correctly. Interesting, why is my code using more time? |
What happens when you send 4 bytes? I tried something like this and ended up sending two 32 bit words instead of just the one when sending only 4 bytes. I'll give this another try tomorrow. |
Every 32-bit transfer has the same timing. 1.31µs to send then a 167ns gap. I guess it is less than 167ns since clock needs to be low for about 21ns. Here it is with markers: There are 57 transfers like this. They all have the 167ns gap. The gap varies a tinny bit since the Salese has 500 MS/s so 2ns jitter is to be expected. |
You may have caught this. I get a warning rxbuf may be used uninitialized here. I noticed two of your comments that hit home with me.
I wrote a SPI DMA driver for Due more like 10 years ago and it was not accepted. For Due I managed to slip DMA into SdFat in a way compatible with the Standard SPI Library. Here is the difference for SD card I/O: Arduino SPI lib:
With my DMA for Due:
Hope your RA4M1 driver makes it. |
I saw that and dismissed it but your comment made me look again.
That really is odd.
I am implementing DMA left and right in my EVE library and also for Arduino targets. :-)
You really do not need to convince me. :-)
Well, the SPI class for the R4 is so slow now, something needs to be done about this. My hope would be that vendor libraries get more usefull beyond proof of concept. |
This was one of the first things I noticed. I did a bit of study of the SPI controller architecture and the FSP implementation with interrupts. The TX buffer empty interrupt is faster than the RX complete interrupt. You would expect an overrun. However the config is set so the SPI clock is stretched, seven pulses then finally the eighth when the RX buffer is cleared. This holds off the TX empty interrupt since the RX buffer is full. This bit is set in the SPI.begin() call.
Look carefully at on of my earlier posts for the clock stretching. The clue was the SPI analyzer showed the time for a byte is really long because of clock stretching. So the time only depends on interrupt times because of the TX and RX buffers plus clock stretching. This was my clue I could load two TX transfers in the first loop of my optimized function. I put a delay between the first loop and the second loop and got clock stretching, not an RX buffer overrun. |
I only saw that when I still tried to figure out what FSP is configuring differently to allow the SPI to work at all. |
NXP, yeah, I had some fun with S32K. And with ST I had the least issues, I am mostly using the LL HAL functions and that stuff is on point: __STATIC_INLINE void LL_SPI_TransmitData8(SPI_TypeDef *SPIx, uint8_t TxData) The only "hickup" I found so far is that for the STM32H7 it is I do not actually use STM32 controllers though as these are not automotive qualified. |
Too bad you don't like RTOSs. I guy that works or did work for ST did an RTOS and HAL for many STM32 processors. DMA is available for almost everything but you have a choice. Here is the ChibOS site. The tiny kernel I ported to AVR Arduinos, Due and Teensy is his. I have a huge collections of the ST NUCLEO boards. I evaluate processors on these then find the best board for my project. Unlike Arduino, he encourages users to contribute. He has a section for community supported processors from other vendors. I did a data logger that logged data to a SD at 2 million samples per second using his DMA ADC and DMA SPI. The STM32 ADC is amazing. you can setup a list of channel and the DMA controller automatically processes it. Edit: Here is what he supports in the HAL for popular STM32 chips:
|
Here is a list of boards with ChibiOS examples:
|
Well, it's not so much that I do not like RTOS, it is more that I have no use for an extra software layer, especially not on a single core and relatively slow mikro-controller. |
Guess I worked in a different world. Fast for this experiment means almost a billion events per second with a combined data volume of more than 60 million megabytes per second from a million channels filtered by hardware triggers to where a 40,000 core processor farm can select 1,000 interesting events per second. In this world RTOSs have been key since the late 1980s. Even for "slow single core MCUs". I did early work on the architecture of the network between the detector and the farm by simulation of a huge Clos network of Ethernet switches. This idea was finally used. I am a physicist, not an engineer but I like to play with fast MCUs in my hobby. |
Yes, I would say so, I am a bit pusher in the embedded world of automotive and for the most part did small CAN and LIN nodes with rather limited resources. :-) |
You may see an RTOS I was involved with in the auto world. In about 1983 Physics funding dried up and I left the lab to lead a project moving American Express from paper to electronic data. We needed to image and OCR paper until terminals for credit cards were installed. I wanted an RTOS to control cameras installed on fast card sorters. Two other lab employees, Dave Wilner and Jerry Fiddler, founded Wind River Systems. Each contributed $3,000 and a desk to the business and took the public RTOS code from the lab and developed VxWorks. I took a chance and funded development. They became very wealthy and I went back to physics. VxWorks went to Mars in the 1990s and is now is in aerospace and and many other safety critical systems. I can't use it now, $18,000 a seat. Intel owns Wind River now so knowing Jerry and Dave doesn't help. |
Dear @greiman and @RudolphRiedel ☕ 👋 Fascinating as I may find to observe your spirited discussion, may I ask you to confine this conversation to the issue at hand? Perhaps you could exchange email addresses and discuss your non-SPI-related topics somewhere else🙏 😉 🙇 . |
I may not have the latest version so ignore this if this is not a problem. The SPCR2_SCKASE bit seems to no longer be set so transfer(buf, count) now hangs if a interrupt happens and takes too long. To see this, add a delayMicroseconds() like this. I used 24MHz clock.
It will hang for an array transfer long enough to cause an RX overrun. Set SPCR2_SCKASE like this and it will work even with the delayMicroseconds().
Too bad the API that is needed for fast SD read/write is not allowed. SD on R4 boards are really fast with it. Write: 2243 KB/sec I will use transfer(buf, count) but it requires memcpy, tmp buffers and filling arrays with 0XFF. Much of the speed is lost. I will provide users mods to use it if they need speed. I already do it for Due, SAMD and other boards.
Sorry about going off topic. |
I am not sure of the accuracy of any of the above measurements using micros(). micros() seems to have a bug. Here is the problem:
Here is the print out for the first few cases with bgn > end.
|
Yes, SPCR2_SCKASE is no longer set. I even went as far as putting a delay(5); in the (index_tx < n32) loop and it works fine, although very slow. Very nice, thank you for spotting this! |
Hmm. Has anyone looked at the "Data Transfer Controller" ("DTC") ? |
The chip has the DWT ARM component, so there's a cycle counter that is pretty easy to access... |
I have now and I would not call it "simplified", at least not in regards of using the unit with registers not directly accessible. I am looking at the DMAC now and it can only transfer blocks of 1 to 1024 data units. |
Thanks, I know how to use it. I will also use it a lot on R4 boards since a call to micros() currently takes 8-9µs That's why the difference above is 992µs . If you look at micros() you will see an interrupt protection bug that results in one extra tick of the millis clock in the beginning call to micros(). So off by an extra 1000µs happens maybe about once in 10000-20000 times. Of course you could have an extra tick in the end call. I didn't test for that. Edit: See this. There is an extra 1000µs in both calls. It also happens far more often than I suspected - surprised it was not found by Arduino. So intervals measured by micros are +- more than 1000µs. More because of time for call. |
This could be useful in traces of SPI performance. Here is an interesting post for fast bare-metal GPIO on RA4M1 to put markers on a scope or logic analyzer channel. I was about to work this out but she has already done it. 83 ns or four 20.83 ns clocks to set or clear a pin value. Arduino post: GitHub file: Looks like she is testing the ADC on R4. A big interest for me. |
That is one odd GPIO module after working with ATSAM for a while now. |
I just remembered this open point, any news from Renesas regarding this? |
Hi @RudolphRiedel ☕ 👋 I've relayed your request (that I happen to share) to our contacts at Renesas already in July, when the topic first came up. The request was received positively but I was informed that delivering an updated datasheet may be delayed due to the holiday season. I have just restated my request to Renesas (the initial CC'd request had never left my inbox - so I'm not going to forget this) and hope I'll get a positive reply sometime soonish. All the best, Alex |
Fix: SPI not re-initialized when `end()` being called due to SPI timeout. Former-commit-id: 2fb696c
The SPI with the R4 is really slow, way slower than with the R3 when running the same clock and the same code.
SPI_Arduino_Uno.zip
This has three files for use with Saleae Logic 2.
SPI_Arduino_Uno_R3_SPI_Transfer.sal
- single byteSPI.transfer()
SPI_Arduino_Uno_R4_SPI_Transfer.sal
- single byteSPI.transfer()
SPI_Arduino_Uno_R4_SPI_Transfer_Buffer.sal
- buffer transmit, for 4 bytes and for the whole large bufferThe "large" transfer has 3 bytes address + 228 bytes data.
R3: 524µs including all calculations necessary for that data
R4: 2.52ms including all calculations necessary for that data - so this takes almost five times longer
R4 buffer: 634µs without any calculations, only transferring the buffer
Is this something that is actively looked into so far?
The underlying FSP library from Renesas is rather slow.
It configures the SPI on every single transfer, even going as far as activating and deactivating the SPI thru the enable bit.
On the R3 I get <2µs pauses between two transfers.
On the R4 there are 11µs between two transfers - at three times the core clock.
I modified SPI.cpp to use a busy loop for with TX only for the buffer transfer:
This brought down the time to transfer the buffer to 310us with 400ns pauses.
Still not what the RA4M1 could do but at least it beats the AVR on the R3.
So this is entirely an software issue.
A couple of function calls into
R_SPI_WriteRead()
to check out what is going on there I found thatthe buffer is send with an interrupt function.
So essentially it looks like the function is spending more time going into and out of that interrupt than it takes to transfer a byte.
And on top the RA4M1 has DMA, please make use of it.
Preferably take inspiration from the Teensy 4 SPI class that has DMA support baked in to allow this:
This "event" is for a callback function, a custom one that can be attached to the event and allows code to be executed when the DMA is done like setting CS to high again.
And setting "dest" to
NULL
makes the operation write only.The text was updated successfully, but these errors were encountered: