Skip to content

UART RX Buffer Corruption in HardwareSerial #421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeffstaley opened this issue Jan 27, 2019 · 13 comments
Closed

UART RX Buffer Corruption in HardwareSerial #421

jeffstaley opened this issue Jan 27, 2019 · 13 comments

Comments

@jeffstaley
Copy link

I'm running Marlin on a 3d printer using a RemRam board (STM32F7) and I'm using the UART for communication to the host. During a 3d print there's a large amount of UART communication. At 115200 baud I am mostly successful but at 250000 I am 100% guaranteed to hang at some point in the print. I was able to trace the problem down to what I strongly believed to be a corrupted HardwareSerial RX buffer.

I looked at the code but I couldn't see anything obviously wrong. But I did notice that you're relying on atomic writes on the buffer indexes for this to work correctly. That always worries me especially with non-32-bit writes and potential alignment issues.

I ended up putting critical sections around all of HardwareSerial's access functions. For instance,

int HardwareSerial::read(void)
{
  int r;
  ATOMIC_BLOCK(ATOMIC_RESTORESTATE) {
    // if the head isn't ahead of the tail, we don't have any characters
    if (_serial.rx_head == _serial.rx_tail) {
      r = -1;
    } else {
      unsigned char c = _serial.rx_buff[_serial.rx_tail];
      _serial.rx_tail = (rx_buffer_index_t)(_serial.rx_tail + 1) % SERIAL_RX_BUFFER_SIZE;
      r = c;
    }
  }
  return r;
}

Once I did this I went from 100% failure to 100% success. I suggest that this is evidence for a race condition in HardwareSerial. I've written similar code for other ARM chips and I've always taken the more conservative approach of wrapping the code in critical regions and not relied on a compiler's atomic write which I consider to be a bit too problematic.

If you do decide to implement critical regions you don't need to do the heavy hammer approach I did by turning off all interrupts, instead you could do a more surgical approach of just turning off the appropriate UART interrupt. Just remember to consider whether interrupts are already off (you don't want to turn them back on if someone has turned them off) and make sure you consider how to stop the compiler optimizer from re-order memory access. I'll be happy to share how I did all that if it's helpful to you.

@jeffstaley jeffstaley changed the title RX Buffer Corruption in HardwareSerial UART RX Buffer Corruption in HardwareSerial Jan 27, 2019
@fpistm
Copy link
Member

fpistm commented Jan 28, 2019

Hi @jeffstaley
Thanks for the feedback.
All input are welcome. Mainly to try reproduce.
Anyway, did you manage to increase RX/TX buffer size ?
By default , the Rx and Tx buffer of the core is 64.

#if !defined(SERIAL_TX_BUFFER_SIZE)

So, I've changed it to 256 and now it work.
You can redefined them by adding a new file (Tab on Arduino IDE) named build_opt.h
Then in this file put:
-DSERIAL_RX_BUFFER_SIZE=256 -DSERIAL_TX_BUFFER_SIZE=256

Increasing only one of both works also but I advise to keep them aligned.

One other thing which could raise error is the use of I/DCache for cortex M7. By default it is activated,

#ifndef I_CACHE_DISABLED

you can disable them by adding
-DI_CACHE_DISABLED -DD_CACHE_DISABLED

@jeffstaley
Copy link
Author

I did try it with various sizes of SERIAL_RX_BUFFER_SIZE without any improvement. But I was only changing the RX buffer size since I didn't see any reason to keep them in sync. I'll run it with them in sync and see if that makes any difference.

I even changed the rx_head and tx_head to uint32_t in case it had something to do with non-native size but no luck.

Cache! I totally forgot that this has a cache. I better go do some reading on that. Meanwhile I'll try disabling it.

And thanks so much for the prompt response.

Jeff

@hasenbanck
Copy link
Contributor

@fpistm I thought the F7 D-cache would only be problematic when using DMA? (I ran into issues with it in other projects where I used DMA and an external FPGA based SRAM interface, so I deactivated it there)

If this problem should be indeed D-cache related, we should disable it for the RemRam variant (because Marlin is the main target anyhow). But lets see if it's really the culprit before deciding to do so.

@jeffstaley
Copy link
Author

BTW, in terms of full disclosure, I am compiling this in PlatformIO. I do see the same problem with the Arduino IDE though. I haven't figured out a clean way getting a complete copy of your latest master into platformIO. But I have made sure that I'm using the latest versions of uart.* and HardwareSerial.*. If you don't think this is enough please let me know and I'll figure something out.

@fpistm
Copy link
Member

fpistm commented Jan 28, 2019

I think you use the lastest version oof HardwareSerial.
Anyway it should be fine to test increasing buffer.
And if you have a test case to reproduce this should be fine.
If you think having a clean solution for that do not hesitate.

@jeffstaley
Copy link
Author

Unfortunately the test case for this involves firing up the printer and doing a half-an-hour print. I've tried to reduce the test case down to something that we pin point the problem, but I haven't be successful. Fortunately, although lengthy, it does fail 100%.

@jeffstaley
Copy link
Author

Adding -DSERIAL_RX_BUFFER_SIZE=256 -DSERIAL_TX_BUFFER_SIZE=256 made no difference. Still have a corrupted buffer.

-DI_CACHE_DISABLED -DD_CACHE_DISABLED made my stepper drivers unstable. Couldn't even home the machine without the steppers going crazy. Not sure what's happening there. Interrupt timing issues? @nils any chance you can try disabling the cache on your printer?

@hasenbanck
Copy link
Contributor

@jeffstaley Disabling the D and I Cache doesn't seem to degrade my stepper performance in any way. I will start a test print to make sure.

@jeffstaley
Copy link
Author

I wonder what's going on with mine? A simple home command starts OK but quickly the steppers start chattering and heading out into left field. Need to think on how to debug this....

Thanks for checking.

@jeffstaley
Copy link
Author

TL;DR Nothing to see here at this point

With PlatformIO being problematic with the disabling the cache causing stepper timing issues. I went back to testing with the Arduino IDE with a fresh rebase of the STM32 Core. (It doesn't have any issues with the cache disabled) And initially found that the buffer problem occurs no matter whether the RX&TX buffer size was increased or whether the cache was enabled or disabled. Only putting the critical sections around the RX access functions would allow it to pass the test case.

And then -- after countless fails -- I can't get it to fail, no matter what.

So, either, there's been a subtle shift in timing and the race condition isn't occurring. Or there's something flaky with my build environment -- which I have to seriously consider especially with PlatformIO's cache issues. Or something else.

No matter what, without a solid 100% test case -- which I don't have anymore -- I don't want to waste anyone's time on this. It's probably best to close this out until I figure something out at this end.

@fpistm
Copy link
Member

fpistm commented Jan 29, 2019

Thanks for the feedback @jeffstaley

@jeffstaley
Copy link
Author

Thanks for helping.

@SegevTech
Copy link

How do you use ATOMIC BLOCK library in STM32?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants