Skip to content

Crash and missing data #41

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pyrog opened this issue Jun 23, 2021 · 10 comments
Closed

Crash and missing data #41

pyrog opened this issue Jun 23, 2021 · 10 comments

Comments

@pyrog
Copy link

pyrog commented Jun 23, 2021

Subject of the issue

During a 1 hour and 45 minutes hiking, got:

  • 1 reboot (during 13 seconds)
  • 3 "holes" (missing data during 19.750, 12 and 14.750 secondes).

Your workbench

  • RTK_Surveyor_Firmware_v14RC_June22

  • Default settings SFE_Surveyor_Settings.txt

  • Device connected over Bluetooth with Lefebure NTRIP Client

  • Device and phone powered with their internal batteries

Feature request

  • Log reboot debug
  • You could use NMEA TXT sentence for that ?
@nseidle
Copy link
Member

nseidle commented Jun 23, 2021

Thanks for the settings file! It's helpful to be able to see things like your current firmware, ZED firmware, and sentences enabled.

Log reboot reason to log file - good idea. But oh man, you're going to make me write a NMEA generator :) I'll see what I can do.

@nseidle
Copy link
Member

nseidle commented Jun 23, 2021

FYI - we were recording a brief message if the log was reused.

nseidle added a commit that referenced this issue Jun 23, 2021
@nseidle
Copy link
Member

nseidle commented Jun 23, 2021

Ok, this is what it looks like:

image

New binary is on the release page.

@pyrog
Copy link
Author

pyrog commented Jun 24, 2021

Update firmware at home this morning:

$GNTXT,01,01,01,ESP_RST_SW*47

First test on the field (20 mn car trip):

$GNTXT,01,01,01,ESP_RST_POWERON*1D

SFE_Surveyor_Settings.txt

Note: the firmware version could be also in this (or another) GNTXT NMEA sentence.

@nseidle
Copy link
Member

nseidle commented Jun 24, 2021

FYI - I have a few leads on the crash issue. I believe it is related to the SPP Bluetooth and too many mallocs that are leading to a loss of heap, that eventually leads to system reset. Below is a log of ~1 hour with Lefebure NTRIP sending data, logging, a handful of extra UBX messages turned on.

FreeHeap: 105796 / MinFreeSize: 336 / LargestBlock: 69088
FreeHeap: 105796 / MinFreeSize: 336 / LargestBlock: 69088
FreeHeap: 105796 / MinFreeSize: 336 / LargestBlock: 69088
FreeHeap: 105796 / MinFreeSize: 336 / LargestBlock: 69088
UBX file size: 36271051 - Generation rate: 12.8kB/s - Write speed: 5374.7kB/s
FreeHeap: 105796 / MinFreeSize: 336 / LargestBlock: 69088
FreeHeap: 97488 / MinFreeSize: 336 / LargestBlock: 58096
FreeHeap: 105796 / MinFreeSize: 336 / LargestBlock: 69088
FreeHeap: 101676 / MinFreeSize: 336 / LargestBlock: 64916
FreeHeap: 101272 / MinFreeSize: 336 / LargestBlock: 64036
UBX file size: 36335663 - Generation rate: 12.9kB/s - Write speed: 5300.4kB/s
FreeHeap: 101676 / MinFreeSize: 336 / LargestBlock: 64916
FreeHeap: 101240 / MinFreeSize: 336 / LargestBlock: 64056
FreeHeap: 101676 / MinFreeSize: 336 / LargestBlock: 64936
FreeHeap: 93560 / MinFreeSize: 336 / LargestBlock: 55000
[E][BluetoothSerial.cpp:184] _spp_send_buffer(): SPP Write Congested!
FreeHeap: 82488 / MinFreeSize: 336 / LargestBlock: 44144
UBX file size: 36400263 - Generation rate: 12.9kB/s - Write speed: 5179.2kB/s
[E][BluetoothSerial.cpp:184] _spp_send_buffer(): SPP Write Congested!
FreeHeap: 72608 / MinFreeSize: 336 / LargestBlock: 31916
FreeHeap: 59316 / MinFreeSize: 336 / LargestBlock: 22104
[E][BluetoothSerial.cpp:184] _spp_send_buffer(): SPP Write Congested!
FreeHeap: 49720 / MinFreeSize: 336 / LargestBlock: 12580
[E][BluetoothSerial.cpp:184] _spp_send_buffer(): SPP Write Congested!
FreeHeap: 49536 / MinFreeSize: 336 / LargestBlock: 5796
FreeHeap: 72780 / MinFreeSize: 336 / LargestBlock: 29096
UBX file size: 36464811 - Generation rate: 12.9kB/s - Write speed: 5496.7kB/s
FreeHeap: 85252 / MinFreeSize: 336 / LargestBlock: 39996
FreeHeap: 101208 / MinFreeSize: 336 / LargestBlock: 64500

You can see a few things:

  1. The FreeHeap takes a nose dive from ~105k to ~50k and this is associated with SPP congestion. Also, the MinFreeSize is showing the low water mark for the heap. Earlier in this log there are more SPP congestions and _spp_queue_packet(): SPP TX Packet Malloc Failed! errors. By the end of the log the MinFreeSize is 140 indicating that at one point the ESP32 had only 140bytes of free RAM which is a hair's breadth away from crashing/restart. Interestingly, this log did not experience a system reset. As shown in the log the system does recover the heap over time but a series of events lead to the heap being pulled down badly.

My current theory is that the Bluetooth link is degrading which leads to backups and additional buffers being malloc'd.

So, @pyrog and @Stefal: When you are testing in the field, how far away is the phone or tablet from the RTK Surveyor? I suspect >10m from the RTK Surveyor will begin to cause these types of SPP issues over a period of minutes.

FYI - The way I am introducing a 'bad link' is by putting my phone in the microwave (and moving the kid's ladder away so that my 1 year old doesn't cook my phone). The microwave is a great 2.4GHz Faraday cage. Fun fact: my phone got a call while it was in the microwave because a microwave is designed to block 2.4GHz, not 1.4GHz or whatever my LTE is operating at (yay EE!).

@pyrog
Copy link
Author

pyrog commented Jun 25, 2021

When you are testing in the field, how far away is the phone or tablet from the RTK Surveyor?

Usually, less than 1 meter.
When the device crashed, the surveyor was in my hand and the phone was over my head.

Initially I thought more of a regression because it happened just after the firmware update?
But I also tested the new version of the Mapillary app that sometimes display error messages.
I guess that it could slow the phone, and send a burst of RTCM over the SPP??

@nseidle
Copy link
Member

nseidle commented Jun 25, 2021

Usually, less than 1 meter.

Ya, I figured it was near by but I was hoping maybe you had the phone far away for some reason. I will keep working on it.

After much searching it does seem that SPP performance varies between phones and manufacturers. I will attempt to make the RTK Surveyor as friendly as possible to all devices.

@pyrog
Copy link
Author

pyrog commented Jun 27, 2021

In all my later tests, I didn't got any crash nor missing data 😃
I set Mapillary app to take pictures every 0.5 s minimum, or use Open Camera with the same settings.

My theory is that if Mapillary take pictures too fast (image min delay=0) and/or the phone memory is nearly full, the SPP consuming app in Android couldn't respond fast enough, then provoque a SPP congestion…
… so the RTK Surveyor alloc memory to store data during the SPP congestion.

So this is mainly a smartphone issue 😅

It could be reproduced this ways?

  • write a BlueTooth app that consumes SPP data too slow
  • stress the smartphone with an app that consume too much CPU and/or memory to slow it

@nseidle
The RTK Surveyor could display a fast blinking red led during SPP congestion to warn user?
And don't alloc to much memory to avoid a reboot.

@nseidle
Copy link
Member

nseidle commented Jun 28, 2021

Ok, I think we're starting to solve this issue!

Is it acceptable to not transmit data when Bluetooth becomes congested? Data logging will remain unaffected? I think so but I'd like to know your thoughts.

@pyrog - I agree. As the link degrades (either because the phone is busy or the RF interference increases) the BluetoothSerial library attempts to buffer more and more data causing heap shortages. To avoid this I've added a isCongested check before adding new NMEA data from the ZED into the SPP buffer. This allows ZED data to always be recorded to SD, but rejected/dropped if the BT SPP is congested. This setting called throttleDuringSPPCongestion is accessed in the debug menu and is true by default (drop NMEA/UBX/etc data if BT is bad).

11:23:48.281 -> [D][BluetoothSerial.cpp:302] esp_spp_cb(): ESP_SPP_CONG_EVT: CONGESTED
11:23:48.376 -> [D][Tasks.ino:65] F9PSerialReadTask(): Dropped SPP Bytes: 1170
11:23:48.609 -> FreeHeap: 104392 / HeapLowestPoint: 59572 / LargestBlock: 67736
11:23:48.609 -> [D][BluetoothSerial.cpp:302] esp_spp_cb(): ESP_SPP_CONG_EVT: CONGESTED
11:23:48.689 -> UBX file size: 14653071 - Generation rate: 18.1kB/s - Write speed: 33381.5kB/s
11:23:48.983 -> Rover Accuracy (m): 0.0141
11:23:49.030 -> [D][BluetoothSerial.cpp:302] esp_spp_cb(): ESP_SPP_CONG_EVT: CONGESTED
11:23:49.313 -> [D][BluetoothSerial.cpp:302] esp_spp_cb(): ESP_SPP_CONG_EVT: CONGESTED
11:23:49.360 -> [D][Tasks.ino:65] F9PSerialReadTask(): Dropped SPP Bytes: 930
11:23:49.595 -> FreeHeap: 104392 / HeapLowestPoint: 59572 / LargestBlock: 67736
11:23:50.065 -> [D][BluetoothSerial.cpp:302] esp_spp_cb(): ESP_SPP_CONG_EVT: CONGESTED
11:23:50.298 -> [D][BluetoothSerial.cpp:302] esp_spp_cb(): ESP_SPP_CONG_EVT: CONGESTED
11:23:50.627 -> FreeHeap: 104392 / HeapLowestPoint: 59572 / LargestBlock: 67736
11:23:50.627 -> Rover Accuracy (m): 0.0141

Above you can see the F9PSerialReadTask dropping bytes (1170, 930, etc) but the heap is maintaining. I've been running a harsh test and so far, so good: My phone is in a metal drawer. RTK is covered by an ESD bin. This is enough to cause a lot of RF problems, but not drop the link entirely. I've got a lot of messages turned on to increase the TX to ~18k/s.

The downside to throttleDuringSPPCongestion is that bytes may be dropped (obviously). But testing with SW Maps it's not noticeable to the end user because the app checks for valid checksums before consuming the NMEA data. If a user is logging the data over Bluetooth then they will be most affected by the dropped data. But local logging should be uncorrupted.

I'll add the June28 RC binary to the release page. Please give it a try and let me know what you think.

@nseidle
Copy link
Member

nseidle commented Jul 22, 2021

I believe we've reduced the system crashes due to BT congestion quite a bit (I haven't seen a crash in ages) but there will be dropped data if the link becomes congested. Closing. Please open a new issue if you discover anything new.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants