Skip to content

SPI with the R4 is slower than with the R3 #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RudolphRiedel opened this issue Jul 3, 2023 · 75 comments
Closed

SPI with the R4 is slower than with the R3 #28

RudolphRiedel opened this issue Jul 3, 2023 · 75 comments

Comments

@RudolphRiedel
Copy link
Contributor

RudolphRiedel commented Jul 3, 2023

The SPI with the R4 is really slow, way slower than with the R3 when running the same clock and the same code.

SPI_Arduino_Uno.zip

This has three files for use with Saleae Logic 2.

  • SPI_Arduino_Uno_R3_SPI_Transfer.sal - single byte SPI.transfer()
  • SPI_Arduino_Uno_R4_SPI_Transfer.sal - single byte SPI.transfer()
  • SPI_Arduino_Uno_R4_SPI_Transfer_Buffer.sal - buffer transmit, for 4 bytes and for the whole large buffer

The "large" transfer has 3 bytes address + 228 bytes data.
R3: 524µs including all calculations necessary for that data
R4: 2.52ms including all calculations necessary for that data - so this takes almost five times longer
R4 buffer: 634µs without any calculations, only transferring the buffer

Is this something that is actively looked into so far?

The underlying FSP library from Renesas is rather slow.
It configures the SPI on every single transfer, even going as far as activating and deactivating the SPI thru the enable bit.

On the R3 I get <2µs pauses between two transfers.
On the R4 there are 11µs between two transfers - at three times the core clock.

I modified SPI.cpp to use a busy loop for with TX only for the buffer transfer:

uint8_t *buffer = (uint8_t *) buf;

_spi_ctrl.p_regs->SPCR = 0x4a; /* enable, no irq, master, tx only */

for(size_t index = 0; index < count; index++)
{
    _spi_ctrl.p_regs->SPDR_BY = buffer[index];
    while (0 == _spi_ctrl.p_regs->SPSR_b.SPTEF) {}
}
    while (_spi_ctrl.p_regs->SPSR_b.IDLNF) {}

_spi_ctrl.p_regs->SPCR = 0xb9; /* off, interrupts, master, full duplex */

This brought down the time to transfer the buffer to 310us with 400ns pauses.
Still not what the RA4M1 could do but at least it beats the AVR on the R3.
So this is entirely an software issue.

A couple of function calls into R_SPI_WriteRead() to check out what is going on there I found that
the buffer is send with an interrupt function.
So essentially it looks like the function is spending more time going into and out of that interrupt than it takes to transfer a byte.

And on top the RA4M1 has DMA, please make use of it.
Preferably take inspiration from the Teensy 4 SPI class that has DMA support baked in to allow this:

SPI.transfer(source, dest, count, event);
SPI.transfer( ((uint8_t *) &EVE_dma_buffer[0]) + 1U, NULL, (((EVE_dma_buffer_index) * 4U) - 1U), EVE_spi_event);

This "event" is for a callback function, a custom one that can be attached to the event and allows code to be executed when the DMA is done like setting CS to high again.
And setting "dest" to NULL makes the operation write only.

maidnl pushed a commit that referenced this issue Jul 4, 2023
Fix: SPI not re-initialized when `end()` being called due to SPI timeout.
@aentinger
Copy link
Contributor

Hi @RudolphRiedel ☕ 👋

Thank you for taking the time to take those measurements and report them 👍 .

The main motivation for using the FSP layer is to achieve maximum portability across various Renesas platforms. That being said, the performance penalty incurred simply sucks.

We'll be looking into ways to enhance SPI performance.

@RudolphRiedel
Copy link
Contributor Author

I am afraid the this is not the only bottleneck that FSP presents but SPI is kind of "my thing" right now and in general something like this really is not unique to FSP.

So I sure will be having some fun going around the SPI class as I did before with a number of other controllers.
But while this works it really reduces the compatibility, playing nice with other SPI nodes is more difficult to achieve then.

On the Arduino side it really would be nice to have a common API that supports SPI transmits over DMA.

@aentinger
Copy link
Contributor

On the Arduino side it really would be nice to have a common API that supports SPI transmits over DMA.

No comments on DMA but if you'd be willing to work on PR improving the performance for the SPI module I'd be happy to check and merge it. Certainly we - Arduino - don't want stay of valueable community contributions. If SPI is your thing, I kindly invite you to bring it on! 🚀

@RudolphRiedel
Copy link
Contributor Author

I can try, I wanted to play with the RA4M1 anyways.
And even if my pull request gets rejected I still will have had some fun with it and code that I can use for myself. :-)
Don't hold your breath though. :-)

@aentinger
Copy link
Contributor

I'm looking forward to see what you're coming up with 😉

@greiman
Copy link

greiman commented Jul 6, 2023

I posted a comparison of Uno R3 vs R4 SdFat performance on the Arduino form. Also Saleae capture of the SPI signals here:

https://forum.arduino.cc/t/uno-r4-poor-spi-performance/1143935

Summary: the R3 is about eight times faster than the R4 for the current version of SdFat..

R4 single bytes transfer standard SPI library:
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
82.98,7164,5160,6162
82.98,7176,5160,6163

R3 array with my custom AVR SPI functions:
write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
689.18,760,728,736
689.18,760,728,736

The clock rate for the R4 is 24 MHz but there is a huge gap of about 11 μs between bytes.

The Arduino SD.h library which is based on a modified 2009 version of SdFat is also very slow on the R4.

I looked at the Renesas FSP SPI implementation. I suspect there is little hope of decent performance at 24 MHz without DMA.

For fast SPI, it would be nice if Arduino added this member function to the SPI API. It is in many third party board support packages. The current void transfer(void *buf, size_t count) is difficult to use since it requires a temporary buffer in many cases.

void transfer(const void* txBuf, void* rxBuf, size_t count);

If txBuf is nullptr, 0XFF is sent. If rxBuf is nullptr receive data is discarded.

I looked at the RA4M1 SPI hardware. It has 32-bit RX and TX buffers so it may be possible to get close to 24 MHz with programmed I/O. The Hardware has a feature that prevents overruns by clock stretching. See my above post for cases where the clock is stopped in FSP and the eighth bit is delayed for about 2 μs. This is great to avoid overruns when using both transmit and receive buffering.

I have a mod for the Arduino SAMD21 core that gets close to full speed at 12 MHz using 8-bit SPI I/O. It is about three times faster than the Arduino core functions with SdFat.

Here is the small but tricky implementation that takes advantage of buffer registers in the SAMD21/SAMD51.

void SERCOM::transferDataSPI(const void *txBuf, void *rxBuf, size_t count) {
  const uint8_t *tx = reinterpret_cast<const uint8_t *>(txBuf);
  uint8_t *rx = reinterpret_cast<uint8_t *>(rxBuf);
  size_t ir = 0;
  size_t it = 0;
  if (rx) {
    while (it < 2 && it < count) {
      if (sercom->SPI.INTFLAG.bit.DRE) {
        sercom->SPI.DATA.reg = tx ? tx[it] : 0XFF;
        it++;
      }
    }
    while (it < count) {
      if (sercom->SPI.INTFLAG.bit.RXC) {
        rx[ir++] = sercom->SPI.DATA.reg;
        sercom->SPI.DATA.reg = tx ? tx[it] : 0XFF;
        it++;
      }
    }
    while (ir < count) {
      if (sercom->SPI.INTFLAG.bit.RXC) {
        rx[ir++] = sercom->SPI.DATA.reg;
      }
    }
  } else if (tx && count) {  // might hang if count == 0
    // Writing '0' to this bit will disable the SPI receiver immediately.
    // The receive buffer will be flushed, data from ongoing receptions
    // will be lost and STATUS.BUFOVF will be cleared.
    sercom->SPI.CTRLB.bit.RXEN = 0;
    while (it < count) {
      if (sercom->SPI.INTFLAG.bit.DRE) {
        sercom->SPI.DATA.reg = tx[it++];
      }
    }
    // wait till all data sent
    while (sercom->SPI.INTFLAG.bit.TXC == 0) {
    }
    // Writing '1' to CTRLB.RXEN when the SPI is enabled will set
    // SYNCBUSY.CTRLB, which will remain set until the receiver is
    // enabled, and CTRLB.RXEN will read back as '1'.
    sercom->SPI.CTRLB.bit.RXEN = 1;
    while (sercom->SPI.CTRLB.bit.RXEN == 0) {
    }
  } 
} 

Arduino Core performance:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
468.78,1111,1087,1089

With the above function:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1359.80,17535,368,374

It would be nice to have something like this for Uno R4.

@RudolphRiedel
Copy link
Contributor Author

I have a couple ideas that I need to try out and I will start with the standard API before venturing into non standard API options.
The challenge is of course to make this robust and non breaking as well. :-)
Well, I already sped up things significantly here.

I am actually quite familiar with the SERCOMs from the ATSAM but this will not be of much help here. :-)

@greiman
Copy link

greiman commented Jul 6, 2023

RudolphRiedel

The point of the SAMD code was that the RA4M1 has similar RX/TX buffer registers so it should be possible to do array send/receive in a loop at fairly high speed.

Even more interesting is the fact that the RA4M1 has a 32 bit mode so you should be able to transfer the bulk of an array using 32-bit mode.

Looks like you are using 8MHz clock to send a 228 byte array in 634µs.

What rate do you get for 24MHz?

I tried the R4 unmodified array transfer at 24 MHz and get about 650µs. This is for both send and receive.

Here is my test program. I looked at it with Salese and the array takes 635µs but the print is 656µs. The clock is 24 MHz.

#define CS_PIN 10
#define SPI_CLOCK 24000000
void test() {
  uint8_t buf[228];
  SPI.beginTransaction(SPISettings(SPI_CLOCK, MSBFIRST, SPI_MODE0));
  uint32_t m = micros();
  SPI.transfer(buf, sizeof(buf));
  m = micros() - m;
  Serial.println(m);
}

It has the strange clock stretching with seven pulses at 24 MHz then the final pulse much later.

Jul6

I ran it at 8MHz and the time is the same as at 24 MHz so it is totally CPU limited. The clock looks like your clock.

@WestfW
Copy link

WestfW commented Jul 7, 2023

It looks vaguely like things would get better if the fsp libraries were compiled to use DMA for SPI, even if they continue to block until the end of the transfer...

@greiman
Copy link

greiman commented Jul 7, 2023

It looks vaguely like things would get better if the fsp libraries were compiled to use DMA for SPI, even if they continue to block until the end of the transfer...

DMA is the best answer. I did DMA for a number of MCUs for use in my Arduino SdFat library.

With my Due DMA function I get:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
4533.09,2313,110,111
4537.21,127,110,111

Due with the standard Arduino library using void transfer(void* buf, size_t count) array transfer:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1598.98,336,317,318
1597.95,336,317,318

Due with the standard Arduino library single byte uint8_t transfer(uint8_t data) in a loop:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
693.39,754,735,736
693.19,754,735,736

The RA4M1 seems to support SPI reliably at 24MHz clock so I would hope to see something close to half the Due rate with SdFat.

This is the result for Uno R4 with the current Arduino library array transfer function at 24 MHz:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
430.33,2638,634,1181
429.85,2663,634,1183

I would hope to get over 2,000 KB/sec with DMA on Uno R4.

Even if DMA is not used, it should be possible to optimize array transfer using the buffering in the SPI controller better.

Here is the improvement I got in SAMD21 at 12 MHz:

MRK ZERO Arduino standard library array transfer:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
425.97,1220,1196,1199
425.89,1221,1196,1199

MRK ZERO with the optimized array transfer function in the above post:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1365.37,396,368,372
1365.37,394,368,372

@greiman
Copy link

greiman commented Jul 7, 2023

The comment by WestfW about blocking with the current SPI API, even if you are using DMA, convinced me DMA is necessary and an improved API is needed.

I ran the Uno R4 FreeRTOS-Blink example. It is wonderful to see Arduino support an RTOS on Uno R4. Next the libraries need to be RTOS friendly.

The SPI API need a clean way to implement non-blocking transfers.

Most RTOS APIs support a more general array transfer. This is becoming common, sometimes with a callback parameter:

void transfer(const void* txBuf, void* rxBuf, size_t count, callback_t callback = nullptr);

@greiman
Copy link

greiman commented Jul 7, 2023

I decided to try RudolphRiedel's program at 24 MHz.

Looks like it is fairly good. About 109µs to transfer a 228 byte buffer. That's over 2000KB/sec. This is way faster than an Uno R3.

jul7

I am now going to try using 32-bit transfers for all except mod 4 of the count. That should allow fast combined write/read transfers.

@RudolphRiedel
Copy link
Contributor Author

Hello,

I was fighting for the last couple of days to speed up single byte transfers, baby steps, understanding the unit before going further.
And it did not work at all.

Right now I have it up and running:
grafik

However, on the way I found out that the datasheet is not correct and that I need to set SPDCR to 0x040.

According to the datasheet though, b7 and b6 are "Reserved": "These bits are read as 0. The write value should be 0."
grafik

This is from R7FA4M1AB.h:
union
{
__IOM uint8_t SPDCR; /*!< (@ 0x0000000B) SPI Data Control Register */

struct
{
    __IOM uint8_t SPFC   : 2;  /*!< [1..0] Number of Frames Specification                     */
    __IOM uint8_t SLSEL  : 2;  /*!< [3..2] SSL Pin Output Select                                    */
    __IOM uint8_t SPRDTD : 1;  /*!< [4..4] SPI Receive/Transmit Data Selection             */
    __IOM uint8_t SPLW   : 1;  /*!< [5..5] SPI Word Access/Halfword Access Specification    */
    __IOM uint8_t SPBYT  : 1;  /*!< [6..6] SPI Byte Access Specification                        */
    uint8_t              : 1;
} SPDCR_b;

};

So SPFC, SLSEL and SPBYT are missing from the datasheet.
What the heck Renesas?

The question really is, what else is missing or wrong?
I am going on a fishing trip for more information now, perhaps there is a newer datasheet for a different controller of the RA family that is more accurate.

@greiman
Copy link

greiman commented Jul 8, 2023

However, on the way I found out that the datasheet is not correct and that I need to set SPDCR to 0x040.

I also discovered the datasheet is incomplete and often wrong. There are even whole registers, not just bit fields that are not documented.

I am reduced to searching all of the FSP source for register names/addresses and bit fields.

I wrote a program to dump all the registers and use it to see what bits are used in the current SPI library.

I fought with 32-bit transfers because the bytes were not sent in the right order. I found an undocumented register, SPDCR2, that has an endian selection bit.

I also can find some definitions of fields by looking at datasheets for newer members of the RA family. UGH!

I now have array transfer working with 32-bit transfers. I can send a 228 byte array in 84µs, or about 2700 KB/Sec. Close enough to the 3000 KB/sec max for 24 MHz clock.

jul8

I should be able to do array send/receive since the time for one 32-bit transfer at 24MHz is 64 CPU cycles. Much better than 16 CPU cycle for 8-bit transfers.

@RudolphRiedel
Copy link
Contributor Author

I found an undocumented register, SPDCR2, that has an endian selection bit.

Unbelieveable.
The question with this one though then becomes if it actually exists.
I found a "Technical Update" for SPDCR that adds SPBYT while still not documenting the other two.
But for SPDCR2 there is not even a "Technical Update".
So it may or may not be implemented.

This really makes you wonder why Renesas is sending developers on a scavenger hunt instead of fixing the documents.
It's not like there hasn't been enough time since october 2019...

I am cleaning up now:
grafik

This is not really stable though so far.

Oh, one other thing, at least the original API needs to use 8 bit mode.
The reason is that for a buffer transfer that buffer might use a number of bytes that is not a multiple of 4 and the buffer might not even be aligned to 32 bit.
My main application is one of these examples, the EVE chips use a 3 byte address followed by different amounts of bytes.
So that 6 bytes from the trace is a 16 bit read operation.

I am using buffer transfers for 32 bit and this is faster over single byte transfers, even with 8 bit transfers.

DMA transfers also need to use 8 bit to support odd number of bytes and not aligned buffers, but then there shouldn't be any pauses left between the bytes.
But well, DMA is something for another day.

@greiman
Copy link

greiman commented Jul 8, 2023

But for SPDCR2 there is not even a "Technical Update".
So it may or may not be implemented.

It is implemented. I found where is is disabled in the current SPI library. When I enabled it the bytes were in the correct order.

Oh, one other thing, at least the original API needs to use 8 bit mode.

The API is void*. You just say how many bytes to transfer. I just send as much as possible as 32-bit transfers. The Cortex-M4 processor supports ARMv7 unaligned accesses so it works fine in non-DMA mode. I have done this on other MCUs.

I could transfer only the part that is on word boundaries but my plan is to transfer the max amount in 32-bit mode and remaining bytes in 8-bit mode.

I am implementing transfer(const void* txBuf, void* rxBuf, size_t count). This is best for SD cards.

Either txBuf or rxBuf may be nullptr to implement pure read or write.

When you write to a card you don't want to write over the sent data. When you read from a SD card you must send 0XFF bytes since the card looks for a possible command. For the Arduino array transfer I must copy data to a tmp buffer for write and fill the buffer with 0XFF for read.

transfer(void* buf, size_t count) is a simple call to the above.

void transfer(void* buf, size_t count) { transfer(buf, buf, count); }

@RudolphRiedel
Copy link
Contributor Author

But for SPDCR2 there is not even a "Technical Update".
So it may or may not be implemented.

It is implemented. I found where is is disabled in the current SPI library. When I enabled it the bytes were in the correct order.

This is not what I meant, it is possible that the register is not implemented in all revisions of the controller and even if we happen to have the most current one it may be gone in the next.

Oh, one other thing, at least the original API needs to use 8 bit mode.

The API is void*. You just say how many bytes to transfer. I just send as much as possible as 32-bit transfers. The Cortex-M4 processor supports ARMv7 unaligned accesses so it works fine in non-DMA mode. I have done this on other MCUs.
I could transfer only the part that is on word boundaries but my plan is to transfer the max amount in 32-bit mode and remaining bytes in 8-bit mode.

This is something to explore, for short buffers the overhead might be too much.
And the function needs to switch back to 8 bit or there will be extra overhead with every transfer(uint8_t data).

I am implementing transfer(const void* txBuf, void* rxBuf, size_t count). This is best for SD cards.

I have no idea why there even is transfer(void *buf, size_t count) in the standard API.
Setting the rxBuf to NULL is what I would want to do.

Which reminds me that I need to change transfer(void *buf, size_t count) from write-only to write/read.

@RudolphRiedel
Copy link
Contributor Author

I pushed out a first draft version:
https://github.com/RudolphRiedel/ArduinoCore-renesas/tree/main/libraries/SPI

This only works with SPI for now, not SCI.
And there are likely kinks, I only tested master operation for example.

@RudolphRiedel
Copy link
Contributor Author

RudolphRiedel commented Jul 8, 2023

Oh, as a side note, there is PlatformIO package now: https://github.com/maxgerhardt/platform-renesas.git
And "official": https://github.com/platformio/platform-renesas-ra

@RudolphRiedel
Copy link
Contributor Author

And here is the SPI trace for my first SPI.cpp draft:
EVE_SPI_Arduino_Uno_R4_buffer_transfer_spi_class_draft.zip

This is now down to 337µs for the buffer of 3+228 bytes and this is with no change to the API.
So transfer(void *buf, size_t count) is still blocking read/write but using direct register access and no interrupts.

How open is Arduino anyways in regards of additional functions?

@greiman
Copy link

greiman commented Jul 8, 2023

How open is Arduino anyways in regards of additional functions

Changing the standard API is probably a big thing. Adding a function for just the R4 core probable won't fly. That's why I haven't pushed for the one I like.

I got a first cut of my transfer(const void* txBuf, void* rxBuf, size_t count) working.

SdFat bench result:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1763.05,4294966590,280,282
1763.05,4294966577,280,280

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
1840.27,4294966565,268,271
1839.59,4294966565,268,271

The problem now is I am using the original single byte transfer to send commands and every byte takes 13µs. A 512 byte block takes 190µs or almost 2700 KB/sec.

I guess I need to grab your version of transfer(uint8_t data);

This already more than a factor of four improvement over the Standard library array transfer.

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
430.40,2650,634,1182
430.51,2638,634,1182

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
437.98,2615,611,1159
438.10,2615,611,1159

Here is the code. I have only tested it with SdFat.

void ArduinoSPI::transfer(const void* txBuf, void* rxBuf, size_t count) {
  size_t n32 = count / 4;
  if (n32) {
   // Save regs
    uint32_t spcr = _spi_ctrl.p_regs->SPCR;
    uint32_t spdcr = _spi_ctrl.p_regs->SPDCR;
    uint32_t spcmd0 = _spi_ctrl.p_regs->SPCMD[0];
    uint32_t spdcr2 = _spi_ctrl.p_regs->SPDCR2;

    _spi_ctrl.p_regs->SPCR = R_SPI0_SPCR_SPE_Msk | R_SPI0_SPCR_MSTR_Msk;
    _spi_ctrl.p_regs->SPDCR = R_SPI0_SPDCR_SPLW_Msk;
    _spi_ctrl.p_regs->SPDCR2 = 1;
    // 32-bit transfer - can't find a symbol for 2
    _spi_ctrl.p_regs->SPCMD[0] = (spcmd0 & ~R_SPI0_SPCMD_SPB_Msk) | (2 << R_SPI0_SPCMD_SPB_Pos);
    const uint32_t* tx32 = (const uint32_t*)txBuf;
    uint32_t* rx32 = (uint32_t*)rxBuf;
    size_t ir = 0;
    size_t it = 0;
    while (it < 2 && it < n32) {
      if (_spi_ctrl.p_regs->SPSR_b.SPTEF) {
        _spi_ctrl.p_regs->SPDR = txBuf ? tx32[it] : 0XFFFFFFFF;
        it++;
      }
    }
    while (it < n32) {
      if (_spi_ctrl.p_regs->SPSR_b.SPRF) {
        uint32_t spdr = _spi_ctrl.p_regs->SPDR;
        _spi_ctrl.p_regs->SPDR = txBuf ? tx32[it] : 0XFFFFFFFF;
        if (rxBuf) {
          rx32[ir] = spdr;
        }
        ir++;
        it++;
      }
    }
    while (ir < n32) {
      if (_spi_ctrl.p_regs->SPSR_b.SPRF) {
        uint32_t spdr = _spi_ctrl.p_regs->SPDR;
        if (rxBuf) {
          rx32[ir] = spdr;
        }
        ir++;
      }
    }
    _spi_ctrl.p_regs->SPCR = spcr;
    _spi_ctrl.p_regs->SPDCR = spdcr;
    _spi_ctrl.p_regs->SPCMD[0] = spcmd0;
    _spi_ctrl.p_regs->SPDCR2 = spdcr2;
  }
  if (count != 4 * n32) {
    uint8_t* rx = (uint8_t*)rxBuf;
    const uint8_t* tx = (const uint8_t*)txBuf;
    for (size_t i = 4 * n32; i < count; i++) {
      uint8_t tmp = transfer(txBuf ? tx[i] : 0XFF);
      if (rxBuf) {
        rx[i] = tmp;
      }
    }
  }
}

@RudolphRiedel
Copy link
Contributor Author

RudolphRiedel commented Jul 9, 2023

I picked up your idea of switching to 32 bit transfers and put it in transfer(void *buf, size_t count).
The overhead for switching to 32 bit transfers and back to 8 bit transfers is so low compared to the speed increase that even a 4 byte transfer is faster.

My buffer is transferred in 266µs now, that is another 20% faster.
And this is still with 8MHz to make things compareable.

SPI_UNO_R4_spi_class_draft.zip

grafik

grafik

This is a combination of a buffer transfer and two single byte transfers, this takes less time in total now than the pause between two single bytes transfers with the original SPI.cpp.
With the original SPI.cpp the complete transfer needs 72µs, this is a factor of 6.6 more time.

grafik

@greiman
Copy link

greiman commented Jul 9, 2023

How fast is your array transfer at 24 MHz. For SD cards the time for 512 bytes is the key number.

That will be what I will need to use in SdFat if Arduino accepts your mods. Unfortunately I need to copy data to a dummy buffer for write and fill the buffer with 0XFF for read so I will never get the max possible speed.

I am not even going to try getting my mods accepted. I have a number of mods for various boards that I provide to users who need high performance.

I am going to try making the 1-3 bytes a single transfer. The controller can transfer almost any number of bits.

I will need to use memcpy to fetch and store the bytes.

@RudolphRiedel
Copy link
Contributor Author

How fast is your array transfer at 24 MHz.

I have it running for about three hours now at 24 MHz and the bigger block needs about 117µs,
It would be faster in single duplex, but well.

I am going to try making the 1-3 bytes a single transfer. The controller can transfer almost any number of bits.

My block ends on three bytes, there is a pause of 680ns between 32 bit transfers, 920ns between the last 32 bit transfer and the first 8 bit transfer and 520ns between the 8 bit transfers.

So, about 240ns for the switch back from 32 to 8, perhaps 320ns with extra logic.
On 1 byte changing to 1-3 bytes will be a little slower.
On 2 bytes this may shave off 200ns.
And on 3 bytes this may shave off 720ns.

At some point one needs to give up on optimizing further. :-)

Switching over to DMA though would reduce the transfer time overall by about 39µs for the buffer I am transferring.
And of course the controller core would be free to do other things during the transfer time, that is a whole other dimension.

Even switching to write-only without DMA should make this about 13µs faster.

@greiman
Copy link

greiman commented Jul 9, 2023

I have it running for about three hours now at 24 MHz and the bigger block needs about 117µs,
It would be faster in single duplex, but well.

What is the time for the 228 byte transfer in full duplex? My function does the transfer in about 85µs.

With all overhead and the calls to micros() this code prints 96µs:

 uint8_t txBuf[228];
 uint8_t rxBuf[228];
...
  uint32_t m = micros();
  SPI.transfer(txBuf, rxBuf, sizeof(txBuf));
  m = micros() - m;
  Serial.println(m);

With 512 bytes it prints 202µs.

@RudolphRiedel
Copy link
Contributor Author

The 117µs is in full duplex of 3+228 bytes and with the default -Os.
But I am measuring with the logic-analyzer from CS low to CS high.

This prints 118:
uint8_t txBuf[228];

uint32_t m = micros();
SPI.transfer(txBuf, sizeof(txBuf));
m = micros() - m;
Serial.println(m);

And 120 for txBuf[231].

A quick mod to half-duplex brought this down to 94 but it stopped working correctly.

Interesting, why is my code using more time?

@greiman
Copy link

greiman commented Jul 9, 2023

Here are two shots from the 228 transfer:

The entire 228 bytes take 89.3µs CS low to CS high
jul9_all

Here are the first two 32-bit transfers. The time from the start of 4-byte transfers is 1.479µs. 228 bytes is 57 transfers.

1.479*57 = 84.3µs

jul9_bgn

I take advantage of the RX/TX double buffering by loading two 32-bit chunks in this loop:

    while (it < 2 && it < n32) {
      if (_spi_ctrl.p_regs->SPSR_b.SPTEF) {
        _spi_ctrl.p_regs->SPDR = txBuf ? tx32[it] : 0XFF;
        it++;
      }
    }

This means the shift register is loaded as soon as I read the data register since 32-bits are already in the TX buffer. I then load the TX buffer again.

    while (it < n32) {
      while (!_spi_ctrl.p_regs->SPSR_b.SPRF) {}
      uint32_t spdr = _spi_ctrl.p_regs->SPDR;   // Shift Register loads from TX buffer while this happens.
      _spi_ctrl.p_regs->SPDR = txBuf ? tx32[it] : 0XFF;  // Put next 4-bytes in TX buffer.
      if (rxBuf) {
        rx32[ir] = spdr;
      }
      ir++;
      it++;
    }

@RudolphRiedel
Copy link
Contributor Author

What happens when you send 4 bytes?
And how long are the pauses between the 32 bit transfers when sending several?

I tried something like this and ended up sending two 32 bit words instead of just the one when sending only 4 bytes.

I'll give this another try tomorrow.

@greiman
Copy link

greiman commented Jul 9, 2023

Every 32-bit transfer has the same timing. 1.31µs to send then a 167ns gap. I guess it is less than 167ns since clock needs to be low for about 21ns.

Here it is with markers:

jul9_time

There are 57 transfers like this. They all have the 167ns gap. The gap varies a tinny bit since the Salese has 500 MS/s so 2ns jitter is to be expected.

@greiman
Copy link

greiman commented Jul 11, 2023

You may have caught this. I get a warning rxbuf may be used uninitialized here.

I noticed two of your comments that hit home with me.

Who is actually driving Arduino forward and when will the 32 bit hardware that is beeing pushed for over 11 years (DUE) be actually utilized as intended?

There is #93 for example to provide DMA support and this one stopped getting comments almost 4 years ago.

I wrote a SPI DMA driver for Due more like 10 years ago and it was not accepted.

For Due I managed to slip DMA into SdFat in a way compatible with the Standard SPI Library.

Here is the difference for SD card I/O:

Arduino SPI lib:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
693.39,754,735,736

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
689.18,743,739,741

With my DMA for Due:

write speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
4533.09,2282,110,111

read speed and latency
speed,max,min,avg
KB/Sec,usec,usec,usec
4512.64,113,110,111

Hope your RA4M1 driver makes it.

@RudolphRiedel
Copy link
Contributor Author

You may have caught this. I get a warning rxbuf may be used uninitialized

I saw that and dismissed it but your comment made me look again.
And the issue was that a line from the SCI code was missing, so thank you for spotting this!

I noticed two of your comments that hit home with me.

Who is actually driving Arduino forward and when will the 32 bit hardware that is beeing pushed for over 11 years (DUE) be actually utilized as intended?

There is #93 for example to provide DMA support and this one stopped getting comments almost 4 years ago.

I wrote a SPI DMA driver for Due more like 10 years ago and it was not accepted.

That really is odd.
But by now there really should be an established API on what to do with DMA in Arduino.
Instead every third party implementation is doing it differently.

For Due I managed to slip DMA into SdFat in a way compatible with the Standard SPI Library.

I am implementing DMA left and right in my EVE library and also for Arduino targets. :-)
Mostly with the help of already existing functions to make things more compatible, like for the RP2040.
Or ESP-IDF for the ESP32.

Here is the difference for SD card I/O:

You really do not need to convince me. :-)
Although for me this is not about raw output but more about using the controller for whatever needs to done
instead of waiting for a buffer to be finished sending.
With DMA I can use a much lower SPI clock, prepare the data, put it up for sending, let the core do something else.
That the transfer is faster is a nice side effect. :-)

Hope your RA4M1 driver makes it.

Well, the SPI class for the R4 is so slow now, something needs to be done about this.
And the FSP library only is supplied as binary, so changing that is even less fun.

My hope would be that vendor libraries get more usefull beyond proof of concept.

@RudolphRiedel
Copy link
Contributor Author

I just did before and after measurements at 8MHz.

grafik
grafik

And out of curiosity I switched the SPI clock to 24MHz with the original FSP based class.
To my suprise it needs exactly the same time:
grafik

Yes, this is at the same scale of 100px = 50µs, the buffer is transferred in 652µs at 8MHz and 24MHz.
Same at 12MHz and 6MHz, 4MHz is a little slower with 654µs.
And yes, I checked the resulting SPI clock.

@greiman
Copy link

greiman commented Jul 12, 2023

And out of curiosity I switched the SPI clock to 24MHz with the original FSP based class.
To my suprise it needs exactly the same time:

This was one of the first things I noticed.

I did a bit of study of the SPI controller architecture and the FSP implementation with interrupts.

The TX buffer empty interrupt is faster than the RX complete interrupt. You would expect an overrun. However the config is set so the SPI clock is stretched, seven pulses then finally the eighth when the RX buffer is cleared. This holds off the TX empty interrupt since the RX buffer is full.

This bit is set in the SPI.begin() call.

   /* Enable SCK Auto Stop setting in order to prevent RX Overflow in Master Mode */
    spcr2 |= (uint32_t) (SPI_MODE_MASTER == p_ctrl->p_cfg->operating_mode) << R_SPI0_SPCR2_SCKASE_Pos;

Look carefully at on of my earlier posts for the clock stretching.

The clue was the SPI analyzer showed the time for a byte is really long because of clock stretching.

So the time only depends on interrupt times because of the TX and RX buffers plus clock stretching.

This was my clue I could load two TX transfers in the first loop of my optimized function. I put a delay between the first loop and the second loop and got clock stretching, not an RX buffer overrun.

@RudolphRiedel
Copy link
Contributor Author

I only saw that when I still tried to figure out what FSP is configuring differently to allow the SPI to work at all.
Renesas really needs to put some more effort into FSP.

@greiman
Copy link

greiman commented Jul 12, 2023

Here is the place where the the clock stretching bit allows FSP to work with interrupts.
It provides back pressure in TX empty.

RenesasSPI

Edit: I was working on this. didn't see you post above.

@greiman
Copy link

greiman commented Jul 12, 2023

My hope would be that vendor libraries get more useful beyond proof of concept.

Boy this is universal I have worked with the top four of these and their software is unusable.

Top 5 MCU

That together with Arduino's policy means lot of talent that could make Arduino better goes to waste.

@RudolphRiedel
Copy link
Contributor Author

NXP, yeah, I had some fun with S32K.
Infineon has an Arduino core for the XMC1100 and I am not a fan.
Microchip, I have no real hands-on experience but I kind of like their Harmony library, at least when compared to ASF from Atmel, unfortunately they decided to make the ATSAM includes incompatible when moving to MPLAB-X...
https://github.com/Microchip-MPLAB-Harmony/csp_apps_sam_d5x_e5x

And with ST I had the least issues, I am mostly using the LL HAL functions and that stuff is on point:

__STATIC_INLINE void LL_SPI_TransmitData8(SPI_TypeDef *SPIx, uint8_t TxData)
{
#if defined (GNUC)
__IO uint8_t *spidr = ((__IO uint8_t *)&SPIx->DR);
*spidr = TxData;
#else
*((__IO uint8_t )&SPIx->DR) = TxData;
#endif /
GNUC */
}

The only "hickup" I found so far is that for the STM32H7 it is
LL_SPI_IsActiveFlag_RXWNE()
while for every other family it is
LL_SPI_IsActiveFlag_RXNE()

I do not actually use STM32 controllers though as these are not automotive qualified.
But I was planning to play more with STM32 and then the UNO R4 was released. :-)

@greiman
Copy link

greiman commented Jul 12, 2023

I do not actually use STM32 controllers though as these are not automotive qualified.
But I was planning to play more with STM32 and then the UNO R4 was released. :-)

Too bad you don't like RTOSs. I guy that works or did work for ST did an RTOS and HAL for many STM32 processors. DMA is available for almost everything but you have a choice.

Here is the ChibOS site. The tiny kernel I ported to AVR Arduinos, Due and Teensy is his.

I have a huge collections of the ST NUCLEO boards. I evaluate processors on these then find the best board for my project. Unlike Arduino, he encourages users to contribute. He has a section for community supported processors from other vendors.

I did a data logger that logged data to a SD at 2 million samples per second using his DMA ADC and DMA SPI. The STM32 ADC is amazing. you can setup a list of channel and the DMA controller automatically processes it.

Edit: Here is what he supports in the HAL for popular STM32 chips:

Driver models for: ADC, CAN, DAC, EXT, GPT, I2C, I2S, ICU, MAC, MMC, PAL, PWM, RTC, SDC, Serial, UART, USB, USB-CDC, SPI, ST, WDG.

@greiman
Copy link

greiman commented Jul 12, 2023

Here is a list of boards with ChibiOS examples:

CMSIS-STM32F407-DISCOVERY
HAL-STM32F407-DISCOVERY
NASA-OSAL-STM32F407-DISCOVERY
NASA-OSAL-STM32F746G-DISCOVERY
NIL-STM32F051-DISCOVERY
NIL-STM32F100-DISCOVERY
NIL-STM32F303-DISCOVERY
NIL-STM32F373-STM32373C_EVAL
NIL-STM32F746G-DISCOVERY
NIL-STM32G071RB-NUCLEO64
NIL-STM32G474RE-NUCLEO64
NIL-STM32H755ZI-NUCLEO144
NIL-STM32L011K4-NUCLEO32
NIL-STM32L152-DISCOVERY
NIL-STM32L476-DISCOVERY
RT-STM32-LWIP-FATFS-USB
RT-STM32-LWIP-FATFS-USB-HTTPS
RT-STM32F030R8-NUCLEO64
RT-STM32F031K6-NUCLEO32
RT-STM32F042K6-NUCLEO32
RT-STM32F051-DISCOVERY
RT-STM32F070RB-NUCLEO64
RT-STM32F072-DISCOVERY
RT-STM32F072RB-NUCLEO64
RT-STM32F091RC-NUCLEO64
RT-STM32F100-DISCOVERY
RT-STM32F103-MAPLEMINI
RT-STM32F103-OLIMEX_STM32_P103
RT-STM32F103-OLIMEX_STM32_P103-FATFS
RT-STM32F103-STM3210E_EVAL-FATFS-USB
RT-STM32F103RB-NUCLEO64
RT-STM32F107-OLIMEX_P107-LWIP
RT-STM32F302R8-NUCLEO64
RT-STM32F303-DISCOVERY
RT-STM32F303K8-NUCLEO32
RT-STM32F303RE-NUCLEO64
RT-STM32F303ZE-NUCLEO144
RT-STM32F334-DISCOVERY
RT-STM32F334R8-NUCLEO64
RT-STM32F373-STM32373C_EVAL
RT-STM32F401C-DISCOVERY
RT-STM32F401RE-NUCLEO64
RT-STM32F407-DISCOVERY
RT-STM32F407-DISCOVERY-G++
RT-STM32F410RB-NUCLEO64
RT-STM32F411RE-NUCLEO64
RT-STM32F412ZG-NUCLEO144
RT-STM32F413ZH-NUCLEO144
RT-STM32F429-DISCOVERY
RT-STM32F429ZI-NUCLEO144
RT-STM32F446RE-NUCLEO64
RT-STM32F446ZE-NUCLEO144
RT-STM32F469I-DISCOVERY
RT-STM32F469I-EVAL-SDP-CK1Z
RT-STM32F722ZE-NUCLEO144
RT-STM32F746G-DISCOVERY
RT-STM32F746ZG-NUCLEO144
RT-STM32F756ZG-NUCLEO144
RT-STM32F767ZI-NUCLEO144
RT-STM32F769I-DISCOVERY
RT-STM32G031K8-NUCLEO32
RT-STM32G071RB-NUCLEO64
RT-STM32G0B1RE-NUCLEO64
RT-STM32G431RB-NUCLEO64
RT-STM32G474RE-DISCOVERY-DPOW1
RT-STM32G474RE-NUCLEO64
RT-STM32G474RE-NUCLEO64-SB_HOST_DYNAMIC
RT-STM32G474RE-NUCLEO64-SB_HOST_STATIC
RT-STM32H723ZG-NUCLEO144
RT-STM32H735IG-DISCOVERY
RT-STM32H743ZI_REV_XY-NUCLEO144
RT-STM32H750XB-DISCOVERY
RT-STM32H755ZI-NUCLEO144
RT-STM32H755ZI_M4-NUCLEO144
RT-STM32H7A3ZI_Q-NUCLEO144
RT-STM32L031K6-NUCLEO32
RT-STM32L053-DISCOVERY
RT-STM32L053R8-NUCLEO64
RT-STM32L073RZ-NUCLEO64
RT-STM32L152-DISCOVERY
RT-STM32L152RE-NUCLEO64
RT-STM32L432KC-NUCLEO32
RT-STM32L452RE-NUCLEO64-P
RT-STM32L476-DISCOVERY
RT-STM32L476RG-NUCLEO64
RT-STM32L496ZG-NUCLEO144
RT-STM32L4P5ZG-NUCLEO144
RT-STM32L4R5ZI-NUCLEO144
RT-STM32L4R9-DISCOVERY
RT-STM32L552ZE-NUCLEO144
RT-STM32L552ZE-NUCLEO144-TZ_GUEST
RT-STM32L552ZE-NUCLEO144-TZ_HOST
RT-STM32MP157A-DK1
RT-STM32WB55CG-NUCLEO48_USB
RT-STM32WB55RG-NUCLEO68
RT-STM32WL55JC-NUCLEO64

@RudolphRiedel
Copy link
Contributor Author

Well, it's not so much that I do not like RTOS, it is more that I have no use for an extra software layer, especially not on a single core and relatively slow mikro-controller.
I have not found the problem that I would want to solve with for example ChibiOS.
This is however as fun as it is off-topic. :-)

@greiman
Copy link

greiman commented Jul 12, 2023

I have not found the problem that I would want to solve with for example ChibiOS.

Guess I worked in a different world. Fast for this experiment means almost a billion events per second with a combined data volume of more than 60 million megabytes per second from a million channels filtered by hardware triggers to where a 40,000 core processor farm can select 1,000 interesting events per second.

In this world RTOSs have been key since the late 1980s. Even for "slow single core MCUs".

I did early work on the architecture of the network between the detector and the farm by simulation of a huge Clos network of Ethernet switches. This idea was finally used.

I am a physicist, not an engineer but I like to play with fast MCUs in my hobby.

@RudolphRiedel
Copy link
Contributor Author

Guess I worked in a different world.

Yes, I would say so, I am a bit pusher in the embedded world of automotive and for the most part did small CAN and LIN nodes with rather limited resources. :-)

@greiman
Copy link

greiman commented Jul 12, 2023

You may see an RTOS I was involved with in the auto world. In about 1983 Physics funding dried up and I left the lab to lead a project moving American Express from paper to electronic data. We needed to image and OCR paper until terminals for credit cards were installed. I wanted an RTOS to control cameras installed on fast card sorters.

Two other lab employees, Dave Wilner and Jerry Fiddler, founded Wind River Systems. Each contributed $3,000 and a desk to the business and took the public RTOS code from the lab and developed VxWorks. I took a chance and funded development. They became very wealthy and I went back to physics.

VxWorks went to Mars in the 1990s and is now is in aerospace and and many other safety critical systems.

I can't use it now, $18,000 a seat. Intel owns Wind River now so knowing Jerry and Dave doesn't help.

@aentinger
Copy link
Contributor

Dear @greiman and @RudolphRiedel ☕ 👋

Fascinating as I may find to observe your spirited discussion, may I ask you to confine this conversation to the issue at hand? Perhaps you could exchange email addresses and discuss your non-SPI-related topics somewhere else🙏 😉 🙇 .

@greiman
Copy link

greiman commented Jul 13, 2023

I may not have the latest version so ignore this if this is not a problem.

The SPCR2_SCKASE bit seems to no longer be set so transfer(buf, count) now hangs if a interrupt happens and takes too long.

To see this, add a delayMicroseconds() like this. I used 24MHz clock.

            while ((index_tx < 2U) && (index_tx < n32)) {
                if (_spi_ctrl.p_regs->SPSR_b.SPTEF) {
                    _spi_ctrl.p_regs->SPDR = buffer32[index_tx];
                    index_tx++;
                }
            }
delayMicroseconds(10);   // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
            while (index_tx < n32) {
                if (_spi_ctrl.p_regs->SPSR_b.SPRF) {
                    uint32_t tmp = _spi_ctrl.p_regs->SPDR;
                    _spi_ctrl.p_regs->SPDR = buffer32[index_tx];
                    buffer32[index_rx] = tmp;
                    index_rx++;
                    index_tx++;
                }
            }

It will hang for an array transfer long enough to cause an RX overrun.

Set SPCR2_SCKASE like this and it will work even with the delayMicroseconds().


        if (n32 != 0U) {
            _spi_ctrl.p_regs->SPCR_b.SPE = 0; /* disable SPI unit */
            _spi_ctrl.p_regs->SPDCR = R_SPI0_SPDCR_SPLW_Msk; /* SPI word access */
            _spi_ctrl.p_regs->SPCMD_b[0].SPB = 2; /* spi bit width = 32 */
 _spi_ctrl.p_regs->SPCR2_b.SCKASE = 1;  // <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<        
            _spi_ctrl.p_regs->SPCR_b.SPE = 1; /* enable SPI unit */

Too bad the API that is needed for fast SD read/write is not allowed. SD on R4 boards are really fast with it.

Write: 2243 KB/sec
Read: 2262 KB/sec

I will use transfer(buf, count) but it requires memcpy, tmp buffers and filling arrays with 0XFF. Much of the speed is lost.

I will provide users mods to use it if they need speed. I already do it for Due, SAMD and other boards.


Fascinating as I may find to observe your spirited discussion, may I ask you to confine this conversation to the issue at hand? Perhaps you could exchange email addresses and discuss your non-SPI-related topics somewhere else

Sorry about going off topic.

@greiman
Copy link

greiman commented Jul 13, 2023

I am not sure of the accuracy of any of the above measurements using micros(). micros() seems to have a bug.

Here is the problem:

void setup() {
  Serial.begin(9600);
  while(!Serial) {}
  Serial.println("bgn, end, bgn - end");
  for (uint32_t i = 0; i < 10000; i++) {
    uint32_t bgn = micros();
    uint32_t end = micros();
    if (bgn > end) {
      Serial.print(bgn);
      Serial. print(',');
      Serial.print(end);
      Serial.print(',');
      Serial.println(bgn - end);
    }
  }
}
void loop() {
}

Here is the print out for the first few cases with bgn > end.

bgn, end, bgn - end
25992,25000,992
43992,43000,992
65993,65001,992
83994,83002,992

@RudolphRiedel
Copy link
Contributor Author

The SPCR2_SCKASE bit seems to no longer be set so transfer(buf, count) now hangs if a interrupt happens and takes too long.

Yes, SPCR2_SCKASE is no longer set.
Hmm, I had this up and running for several hours now and the measurements occasionally showed pauses in the middle of transfers which I attributed to interrupts.
I added the delayMicroseconds(10); and it stopped working.
I added
_spi_ctrl.p_regs->SPCR2 = R_SPI0_SPCR2_SCKASE_Msk; in configSpi() and it works even with the delay.

I even went as far as putting a delay(5); in the (index_tx < n32) loop and it works fine, although very slow.

Very nice, thank you for spotting this!
And the performance with this bit is the same as without and expected depending on the SPI clock.

@WestfW
Copy link

WestfW commented Jul 14, 2023

Hmm. Has anyone looked at the "Data Transfer Controller" ("DTC") ?
It seems to be a sort of "simplified DMA controller" that might work nicely for SPI.

@WestfW
Copy link

WestfW commented Jul 14, 2023

"accuracy of any of the above measurements using micros(). micros() seems to have a bug."

The chip has the DWT ARM component, so there's a cycle counter that is pretty easy to access...

@RudolphRiedel
Copy link
Contributor Author

Hmm. Has anyone looked at the "Data Transfer Controller" ("DTC") ? It seems to be a sort of "simplified DMA controller" that might work nicely for SPI.

I have now and I would not call it "simplified", at least not in regards of using the unit with registers not directly accessible.
If the DTC actually can work with SPI is something I do not want to explore as it's limitation to 256 data units would mean that my application has to transfer multiple blocks.

I am looking at the DMAC now and it can only transfer blocks of 1 to 1024 data units.
This means I could barely make this work for my application, sending the first three bytes without DMA and then a block of up to 1022 32 bit words with DMA.
Well, maybe the overhead for sending a buffer in blocks of 1024 bytes + n<1024 bytes is not so bad, this is something to explore.
This won't be happening in SPI.cpp though, at least not in the official Arduino core, unfortunately.

@greiman
Copy link

greiman commented Jul 14, 2023

@WestfW

The chip has the DWT ARM component, so there's a cycle counter that is pretty easy to access...

Thanks, I know how to use it. I will also use it a lot on R4 boards since a call to micros() currently takes 8-9µs

That's why the difference above is 992µs . If you look at micros() you will see an interrupt protection bug that results in one extra tick of the millis clock in the beginning call to micros(). So off by an extra 1000µs happens maybe about once in 10000-20000 times.

Of course you could have an extra tick in the end call. I didn't test for that.

Edit: See this. There is an extra 1000µs in both calls. It also happens far more often than I suspected - surprised it was not found by Arduino.

So intervals measured by micros are +- more than 1000µs. More because of time for call.

@greiman
Copy link

greiman commented Jul 14, 2023

This could be useful in traces of SPI performance.

Here is an interesting post for fast bare-metal GPIO on RA4M1 to put markers on a scope or logic analyzer channel. I was about to work this out but she has already done it. 83 ns or four 20.83 ns clocks to set or clear a pin value.

Arduino post:
https://forum.arduino.cc/t/digitalwritefast-with-uno-r4/1145206/26

GitHub file:
https://github.com/TriodeGirl/Arduino_Direct_Register_Operations

Looks like she is testing the ADC on R4. A big interest for me.

@RudolphRiedel
Copy link
Contributor Author

That is one odd GPIO module after working with ATSAM for a while now.

@RudolphRiedel
Copy link
Contributor Author

@aentinger

I've verified your findings and raised this issue (missing documentation in datasheet vis-a-vis FSP driver) with Renesas.
Hopefully there'll be an update or at least an errata.

I just remembered this open point, any news from Renesas regarding this?

@aentinger
Copy link
Contributor

Hi @RudolphRiedel ☕ 👋

I've relayed your request (that I happen to share) to our contacts at Renesas already in July, when the topic first came up. The request was received positively but I was informed that delivering an updated datasheet may be delayed due to the holiday season.

I have just restated my request to Renesas (the initial CC'd request had never left my inbox - so I'm not going to forget this) and hope I'll get a positive reply sometime soonish.

All the best, Alex

cristidragomir97 pushed a commit to cristidragomir97/ArduinoCore-renesas that referenced this issue May 20, 2024
Fix: SPI not re-initialized when `end()` being called due to SPI timeout.
Former-commit-id: 2fb696c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants