Skip to content

Serial[x].print/write - Could use some major performance enhancement. #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
KurtE opened this issue Jul 10, 2023 · 13 comments
Open

Serial[x].print/write - Could use some major performance enhancement. #44

KurtE opened this issue Jul 10, 2023 · 13 comments
Labels
topic: code Related to content of the project itself type: enhancement Proposed improvement

Comments

@KurtE
Copy link
Contributor

KurtE commented Jul 10, 2023

It appears like any calls to Serial.print(), including Serial.println and likewise Serial.write(buffer, cnt), will remain in these calls, until the last byte is queued to the underlying UART.

Here is a simple sketch:

void setup() {
  Serial1.begin(115200); 
  pinMode(2, OUTPUT);
  pinMode(3, OUTPUT);
  pinMode(4, OUTPUT);
  pinMode(5, OUTPUT);
}

uint32_t loop_count = 0;
void loop() {
  digitalWrite(5, HIGH);
  Serial1.print("Hello World...");
  digitalWrite(5, LOW);
  digitalWrite(5, HIGH);
  Serial1.println(loop_count++, DEC);
  digitalWrite(5, LOW);
  delay(10);
  digitalWrite(5, HIGH);
  Serial1.write("Buffer\n", 7);
  digitalWrite(5, LOW);
  delay(1000);  
}

Note, I instrumented the UART::write methods in Serial.cpp:

/* -------------------------------------------------------------------------- */
size_t UART::write(uint8_t c) {
/* -------------------------------------------------------------------------- */  
  if(init_ok) {
    tx_done = false;
    digitalWrite(2, HIGH);
    R_SCI_UART_Write(&uart_ctrl, &c, 1);
    digitalWrite(3, HIGH);
    while (!tx_done) {}
    digitalWrite(3, LOW);
    digitalWrite(2, LOW);  
    return 1;
  }
  else {
    return 0;
  }
}

size_t  UART::write(uint8_t* c, size_t len) {
  if(init_ok) {
    tx_done = false;
    digitalWrite(2, HIGH);
    R_SCI_UART_Write(&uart_ctrl, c, len);
    digitalWrite(3, HIGH);
    while (!tx_done) {}
    digitalWrite(3, LOW);
    digitalWrite(2, LOW);
    return len;
  }
  else {
    return 0;
  }
}

Likewise the callback method:

/* -------------------------------------------------------------------------- */
void UART::WrapperCallback(uart_callback_args_t *p_args) {
/* -------------------------------------------------------------------------- */  

  uint32_t channel = p_args->channel;
  
  UART *uart_ptr = UART::g_uarts[channel];

  if(uart_ptr == nullptr) {
    return;
  }
  

  digitalWrite(4, HIGH);

  switch (p_args->event){
      case UART_EVENT_ERR_PARITY:
      case UART_EVENT_ERR_FRAMING:
      case UART_EVENT_ERR_OVERFLOW:
      case UART_EVENT_RX_COMPLETE: // This is called when all the "expected" data are received
      {
          break;
      }
      case UART_EVENT_TX_COMPLETE:
      case UART_EVENT_TX_DATA_EMPTY:
      {
        //uint8_t to_enqueue = uart_ptr->txBuffer.available() < uart_ptr->uart_ctrl.fifo_depth ? uart_ptr->txBuffer.available() : uart_ptr->uart_ctrl.fifo_depth;
        //while (to_enqueue) {
        uart_ptr->tx_done = true;
        break;
      }
      case UART_EVENT_RX_CHAR:
      {
        if (uart_ptr->rxBuffer.availableForStore()) {
          uart_ptr->rxBuffer.store_char(p_args->data);
        }
        break;
      }
      case UART_EVENT_BREAK_DETECT:
      {
          break;
      }
  }

  digitalWrite(4, LOW);
}

Here are some images of the Logic analyzer output.
image

image

Observations:

  1. it appears like in all cases, the write method: size_t UART::write(uint8_t c)
    is being called, I would have expected the version: size_t UART::write(uint8_t* c, size_t len)
    to be called for the print of a string as well as the simple write of a buffer.

  2. I would have expected that there would be some form of software FIFO queue, like is on most of the other platforms, and only have to wait if you are writing more data than will fit. Would also expect that availableForWrite)_ would be implemented to give you a clue of how much I can write.

  3. Hardware FIFO? I believe the first two UARTS have 16 byte queues? If so why does it look like only one bytes at a type is sent...
    Maybe part of 1).

  4. Why are the write methods waiting at the end at all? Especially when they are waiting for an ISR to be called to set the tx_done?
    The only wait that should happen is if the hardware and/or software queues are full at the start of the call, should they wait for space to become available. And this waiting code should understand and deal with if the code calling write has a higher priority than the Serial UART ISR handler, but that is a different issue Serial.print() - from Interrupt callbacks hangs the processor.  #38

Maybe more as I try to better understand all of the layers of this code.

@KurtE
Copy link
Contributor Author

KurtE commented Jul 11, 2023

  1. it appears like in all cases, the write method: size_t UART::write(uint8_t c)
    is being called, I would have expected the version: size_t UART::write(uint8_t* c, size_t len)
    to be called for the print of a string as well as the simple write of a buffer.
    Did a quick check. This appears to be because the methods parameters are not specified correctly.
    image
    ditto for the .cpp file.
    image
    And it looks like write is only called once. And the ISR is only called twice, so it looks like it probably did store everything into the FIFO. Although I would have expected the call to have returned quickly... So looks like the next rabbit hole to go down.

@aentinger
Copy link
Contributor

Well, that's a bummer. Thank you @KurtE for finding and reporting this.

@facchinm take a look here. There is in fact a safe ring buffer object for transmission but it's not used inside the driver ...

@facchinm
Copy link
Member

Mmmh I don't think I'm understanding the issue... write() is a blocking API, so it must wait the last byte being sent before returning. Did I lose anything on this thread?

@aentinger
Copy link
Contributor

aentinger commented Jul 11, 2023

Basically any call to print/println ends up calling write (see here), so if write is blocking, print is blocking too.

If that's the desired behaviour, fine by me (possibly not fine for our performance-hungry customers though? 😉 ), but then let's remove the unused txBuffer object.

@KurtE
Copy link
Contributor Author

KurtE commented Jul 11, 2023

@facchinm,

Mmmh I don't think I'm understanding the issue... write() is a blocking API, so it must wait the last byte being sent before returning. Did I lose anything on this thread?

Sorry, but I believe that is completely wrong. All of the output functions, like Serial1.write(char), Serial1.write(buffer, cnt), Serial1.print(), Serial1.println(), all get funneled to the two virtual methods. That is:

class UART : public arduino::HardwareSerial
class HardwareSerial : public Stream
class Stream : public Print

Where Print class has:

    virtual size_t write(uint8_t) = 0;
    size_t write(const char *str) {
      if (str == NULL) return 0;
      return write((const uint8_t *)str, strlen(str));
    }
    virtual size_t write(const uint8_t *buffer, size_t size);
    size_t write(const char *buffer, size_t size) {
      return write((const uint8_t *)buffer, size);
    }

If you look at your AVR version of the simple virtual function that outputs one char, it starts off like:

size_t HardwareSerial::write(uint8_t c)
{
  _written = true;
  // If the buffer and the data register is empty, just write the byte
  // to the data register and be done. This shortcut helps
  // significantly improve the effective datarate at high (>
  // 500kbit/s) bitrates, where interrupt overhead becomes a slowdown.
  if (_tx_buffer_head == _tx_buffer_tail && bit_is_set(*_ucsra, UDRE0)) {
    // If TXC is cleared before writing UDR and the previous byte

This method, more or less just stuffs the character into software fifo queue... It has some extra stuff, that if the queue is empty and the hardware port output register is empty, it will simply output the data directly to the register... Plus this code will detect if the queue is full and wait until there is room to put that character into the queue. Also it includes the detection that the call was done while interrupts were disabled and deal with it... Last part is another open issue I mentioned earlier.

Note: to help a sketch to not have the SerialX.write call to not have to wait, the Stream class has another virtual method:
availableForWrite(), which should return how many bytes I can output without having to wait.
Note: This is not implemented in your UART class, so I think resorts to default implementation which returns 0.

If the user wanted to wait until the output was complete, there is another method: flush(), which should do that.
I am also pretty sure this does not work for a couple of reasons:

/* -------------------------------------------------------------------------- */
void UART::flush() {
/* -------------------------------------------------------------------------- */  
  while(txBuffer.available());
}

The first is that the current code is not using the txBuffer.
The second is, that this code would return as soon as your txBuffer was empty. Where as the flush is not supposed to return until all of the data has been shifted out, and if I understand correctly you might have up to 16 bytes of data still in the hardware FIFO queue.

Sorry I am just learning this code base and hardware. But I believe that with all of these layers, you would need to know if you are then into the TX_COMPLETE state...

/* -------------------------------------------------------------------------- */
void UART::WrapperCallback(uart_callback_args_t *p_args) {
/* -------------------------------------------------------------------------- */  
...
  switch (p_args->event){
...
      case UART_EVENT_TX_COMPLETE:
      case UART_EVENT_TX_DATA_EMPTY:
      {
        //uint8_t to_enqueue = uart_ptr->txBuffer.available() < uart_ptr->uart_ctrl.fifo_depth ? uart_ptr->txBuffer.available() : uart_ptr->uart_ctrl.fifo_depth;
        //while (to_enqueue) {
        uart_ptr->tx_done = true;
        break;
      }
      case UART_EVENT_RX_CHAR:
...

Note: I am not sure you are going to want to process those two states in the same way. That is maybe in both cases if your FIFO is not empty you would want to queue up new data to the hardware. But for Flush you only want to return after the
UART_EVENT_TX_COMPLETE has been received... Otherwise code like:

  digitalWrite(cs_pin, LOW);
  Serial1.write(...); // output message to serial1
  ...
  Serial1.flush();
  digitalWrite(cs_pin, HIGH);

will break. Currently some of these issues are masked by other issues.

Hope this makes sense, and that I was not too verbose.
Kurt

@KurtE
Copy link
Contributor Author

KurtE commented Jul 12, 2023

Quick update: I did some more hacking on the Serial.h/cpp files to get a better understanding of what is going on. Also learned a few things:

For example: I naively thought that UART1 would be SCI0 and UART2 would be SCI1 and as such would have FIFO queues. Nope: found out that they use SCI2 and SCI9... So no FIFO...

Also found that my fix for UART::Write(const ... really showed up problems..
Like I temporarily put in method to get the pointer to the SCI registers... And then tried printing some of them:

void PrintSerialInfo(const char *title, UART &uart) {
  R_SCI0_Type *pregs = uart.get_p_reg();
  Serial.println(title); 
  
  uart.printDebugInfo(&Serial);
  Serial.print("Pregs: "); Serial.println((uint32_t)pregs, HEX); 
  Serial.print("SMR: "); Serial.println(pregs->SMR, HEX); //Serial.flush();
  Serial.print("SCR: "); Serial.println(pregs->SCR, HEX); //Serial.flush();
  Serial.print("SSR: "); Serial.println(pregs->SSR, HEX); //Serial.flush();
  Serial.print("SSR_FIFO: "); Serial.println(pregs->SSR_FIFO, HEX); //Serial.flush();
  Serial.print("FCR: "); Serial.println(pregs->FCR, HEX); //Serial.flush();
  Serial.print("FDR: "); Serial.println(pregs->FDR, HEX); //Serial.flush();
  Serial.print("SPTR: "); Serial.println(pregs->SPTR, HEX); //Serial.flush();

}

Characters were being tossed left and right. Or more particularly that a character or two might be lossed per Serial.print/write..
Including for example if you did Serial.println(), the ln turn into another call and so you might lose the the last character(s)...

For my current running, I have completely neutered the UART::write calls (at least for these). And did similar to your writeRaw call, except instead of wait for each write to complete, I wait at the start for the TDRE condition...

/* -------------------------------------------------------------------------- */
size_t UART::write(uint8_t c) {
/* -------------------------------------------------------------------------- */  
  if(init_ok) {
    // KurtE Hack start
    if (channel > 1) return write(&c, 1);
    // KurtE Hack end
    tx_done = false;
    digitalWrite(2, HIGH);
    R_SCI_UART_Write(&uart_ctrl, &c, 1);
    digitalWrite(3, HIGH);
    while (!tx_done) {}
    digitalWrite(3, LOW);
    digitalWrite(2, LOW);  
    return 1;
  }
  else {
    return 0;
  }
}

size_t  UART::write(const uint8_t* c, size_t len) {
  if(init_ok) {
    // KurtE Hack start
    if (channel > 1) {
      // like raw but instead look for room before storing.
      digitalWrite(2, HIGH);
      size_t i = 0;
      while (i < len) {
        while (uart_ctrl.p_reg->SSR_b.TDRE == 0) {} // wait until transmit data register is not empty
        uart_ctrl.p_reg->TDR = *c++;
        i++;
      }
      digitalWrite(2, LOW);
    } else {
    // KurtE Hack end

      tx_done = false;
      digitalWrite(2, HIGH);
      R_SCI_UART_Write(&uart_ctrl, c, len);
      digitalWrite(3, HIGH);
      while (!tx_done) {}
      digitalWrite(3, LOW);
      digitalWrite(2, LOW);
    } // Kurte
    return len;
  }
  else {
    return 0;
  }
}

Flush was still broken, so I hacked that as well:

/* -------------------------------------------------------------------------- */
void UART::flush() {
/* -------------------------------------------------------------------------- */  
  // KurtE
  if (channel > 1) {
    // wait for the tend condition
    while ((uart_ctrl.p_reg->SSR_b.TEND == 0) || (uart_ctrl.p_reg->SSR_b.TDRE == 0)) {}
  } else {
    while(txBuffer.available());
  }
}

Note the code changes completely don't use interrupts, so won't hang if called in an ISR.

The loop code is only slightly changed to add a flush()

void loop() {
  digitalWrite(5, HIGH);
  Serial1.print("Hello World...");
  digitalWrite(5, LOW);
  digitalWrite(5, HIGH);
  Serial1.println(loop_count++, DEC);
  digitalWrite(5, LOW);
  delay(2);
  digitalWrite(5, HIGH);
  Serial1.write("Buffer\n", 7);
  digitalWrite(5, LOW);
  digitalWrite(5, HIGH);
  Serial1.flush();  // tell it to wait. 
  digitalWrite(5, LOW);
  delay(1000);  
}

Which appears to work.

image

These current things are up in my own fork/branch(SerialX_fixes) if anyone wishes to play along.

Next up see how to use your TX Queue.

And then decide if I will do a PR... Still uncomfortable to sign any legal document, which states that this is my Original work... As none of this is overly original... But that is a different topic.

@paulvha
Copy link
Contributor

paulvha commented Jul 22, 2023

KurtE makes very good points about the UART performance. Having worked on many different boards, keep it simple, I would neglect the availability of the FIFO in SC0 and SC1 (just disable it) and always use a TX ring buffer approach. That is what I see many other board libraries do as well: keep clean, easy to maintain, and straightforward.

@KurtE
Copy link
Contributor Author

KurtE commented Jul 23, 2023

@paulvha - thanks for the feedback.

I also have worked on lots of different boards as well and have implemented or updated or suggested updates, to the Serial/UART processing in several of them including Teensy, Trossen, Robotis...

I have a Draft PR (#59) which so far, I have only touched the TX side, and it is now using the ring buffer that was already built into the __ UART __ objects.

I still try to make use of the FIFO if available. In the right places, it should considerably cut down on the number of interrupts needed by the device. That is for example, if the user code outputs, several lines of output, like 256 characters, without the FIFO, the device will generate something like 256 interrupts. With the fifo maybe only something like 16, which is a good win.

And on the RX side, I like having FIFO in place as to reduce the likelihood of losing RX data. On the WIFI board only the connection between the board and the ESP32 for Serial2 (Wifi and BT) is on SCI1, which has the FIFO, and hard to experiment as much on that connection. But I can create a UART object on pins 18,19, which is on SCI0. Which so far my changes appear to be working.

I have put some of my work(play) on hold, waiting for some feedback on things like is it allowed to bypass the FSP code? If so should I try to totally bypass it for the UART code, or is it OK to pick and choose.

But i the mean time I might experiment, with talking to Dynamixel servos at 1MBS maybe higher if supported.

Thanks again for you input.

@paulvha
Copy link
Contributor

paulvha commented Jul 24, 2023

Fully agree that a TX-buffer approach should be applied. I understand that you prefer to continue to use the HW FIFO, still I wonder how much time benefit that will bring and rather go for simplicity and a common UART approach... but it is a choice :-)

Changing Serial (D1 = P302 & D0 = P301) to use SCI0 instead of SCI2 does not require a change in FSP code.

It is defined in the file : variants/UNOWIFIR4/pinmux.inc
line 105 change

SCI_CHANNEL|PIN_TX_MOSI_SDA|CHANNEL_2|SCI_EVEN_CFG|LAST_ITEM_GUARD
to
SCI_CHANNEL|PIN_TX_MOSI_SDA|CHANNEL_0|SCI_EVEN_CFG|LAST_ITEM_GUARD

and line 111 change

SCI_CHANNEL|PIN_RX_MISO_SCL|CHANNEL_2|SCI_EVEN_CFG,
to
SCI_CHANNEL|PIN_RX_MISO_SCL|CHANNEL_0|SCI_EVEN_CFG,

@KurtE
Copy link
Contributor Author

KurtE commented Jul 24, 2023

Thanks for the feedback:

Fully agree that a TX-buffer approach should be applied. I understand that you prefer to continue to use the HW FIFO, still I >wonder how much time benefit that will bring and rather go for simplicity and a common UART approach... but it is a choice :-)

To me this is similar to the Teensy 3.x boards, for example T3.2. The first two USARTS have fifo capability, others do not. These are used for Serial1 and Serial2 (USB is its own sub-system). And it is far better to use these two Serial objects for high speed data. Like the Robotis servos which I was testing yesterday at 1MBS. In some cases, I have bumped up the servo baud rates to 2 or 3mbs, I have tried up to 3.5.

Changing Serial (D1 = P302 & D0 = P301) to use SCI0 instead of SCI2 does not require a change in FSP code.

It is defined in the file : variants/UNOWIFIR4/pinmux.inc
line 105 change

SCI_CHANNEL|PIN_TX_MOSI_SDA|CHANNEL_2|SCI_EVEN_CFG|LAST_ITEM_GUARD
to
SCI_CHANNEL|PIN_TX_MOSI_SDA|CHANNEL_0|SCI_EVEN_CFG|LAST_ITEM_GUARD

I could be wrong: But I think which SCI channel is used is dictated by the hardware.
That is from my Excel document I have:

<style> </style>
WIFI Main Usage Minima I/O ports LQFP64 Power, System, Clock, Debug, CAC, VBATT Interrupt AGT GPT_OPS, POEG GPT RTC USBFS,CAN SCI IIC SPI SSIE ADC14 DAC12, OPAMP ACMPLP SLCDC CTSU
0 Serial1 0 P301 31   IRQ6 AGTIO0 GTOULO GTIOC4B     RXD2/ MISO2/ SCL2 CTS9_ RTS9/ SS9   SSLB2         SEG01/ COM5 TS09
1 Serial1 1 P302 30   IRQ5   GTOUUP GTIOC4A     TXD2/ MOSI2/ SDA2   SSLB3         SEG02/ COM6 TS08
It implies to me that these pins are tied to a specific SCI object.

For what it is worth, my excel document I mentioned is up at https://github.com/KurtE/UNOR4-stuff/blob/main/arduino_uno_r4.xlsx

Which I used Excel to extract from the Renesas RA4M1 Datasheet Table 1.7 pages 23-26
And then add the mapping to which pins on MINIMA and WIFI and have pages for each.
But my understanding may not be correct.

@paulvha
Copy link
Contributor

paulvha commented Jul 24, 2023

Hi Kurt

I had seen (and downloaded) your spreadsheet already. Thanks for making and sharing that !!!
Indeed, defined by default, it is using SCI-2

The way it works is that in serial.cpp, when you do begin(), it will call for cfg_pins().

Here based on the provided pins (D0, D1), the real processor port & pin will be obtained for a structure defined in variant.cpp in the variants/UNOWIFIR4-folder. It will also obtain a list of pin options (defined in pinmux.inc in the same variants folder).
This list is then parsed to find the common SCI channel (GET_CHANNEL) which is stored as channel. Actually it will set channel twice for no good reason.....

That channel will then be used in begin() to initialize and use as SCI-channel.

regards,
Paul

@paulvha
Copy link
Contributor

paulvha commented Jul 24, 2023

I am going to correct myself. Having studied the datasheet a bit more in detail, SCI0 can only be applied to a limited number of port / pin (PSEL as part of PSF-register)

SCI0 -RX : P101 (D18 / A4/SDA), P104 (D2), P206 (Led Matrix), P410 (NC)
SCIO-TX : P100 (D19 / A5/SCL), P205 (Led Matrix), P411 (D11)

The combination of the pins D18/D19 as you used before will work, as well as using D2/D11. So the earlier mentioned change to pinmux.inc will not work.

regards,
Paul

@per1234 per1234 added type: enhancement Proposed improvement topic: code Related to content of the project itself labels Aug 8, 2023
@bplu4t2f
Copy link

bplu4t2f commented Apr 8, 2024

This issue is a bug, not a potential "enhancement". And a rather hefty one IMO.
Serial.write() is supposed to be asynchronous, not to speak of the various other issues already mentioned by KurtE (many thanks for your work).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: code Related to content of the project itself type: enhancement Proposed improvement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants