Skip to content

Unexpected reboot #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Eric-FR opened this issue Aug 30, 2021 · 35 comments
Closed

Unexpected reboot #54

Eric-FR opened this issue Aug 30, 2021 · 35 comments

Comments

@Eric-FR
Copy link

Eric-FR commented Aug 30, 2021

Express.
RTK_Surveyor_Firmware_v15RC-Aug13

I was doing some testing in static conditions with Lefebure client on my smartphone. Then, I saw in Lefebure "No data" for several seconds followed by GPS, DGPS, FloatRTTK and RTK. This appended several times. Looking at the Express's screen, the Express was restarting each time, first this splash screen, second with "Rover" and back to recording.

image

The attached picture is the ubx file opened in RTKLib/RTKPlot as a solution. One can see missing data.

image

If we convert the ubx file to rinex with RTKConv and look at it with RTKPlot as observation data, data are also missing time to time.

image

Looking at dynamic recording, missing data were also observed (open field road).

UBX file zipped:

SFE_Express_210828_222211.zip

Express settings:

SFE_Express_Settings.txt

Investigation are under way to check if the same problem was present in earlier recording.

@Eric-FR
Copy link
Author

Eric-FR commented Aug 31, 2021

Looking at a former recording (August 8) in static condition on my balcony without NTRIP corrections, I can also find missing data time to time:
image

@nseidle
Copy link
Member

nseidle commented Aug 31, 2021

Thanks for reporting!

Can you interpret the time scale for me - is there five gaps of data between ~22:27 to ~22:38? So 5 reboots in about 11 minutes?

From your settings file: You have SBAS enabled but QZSS disabled (that's interesting). Why?

You have SFRBX and RAWX enabled and you are logging. You are sending NTRIP over Bluetooth. You have the latest ZED firmware (v1.13).

Is there anything else I should know to try to replicate?

@Eric-FR
Copy link
Author

Eric-FR commented Aug 31, 2021

Yes, I had several reboots in a couple of minutes.
I recorded SBAS because because we are receiving EGNOS signal in Europe and I may use it for post-processing.
I'm not supposed to receive QZSS.

After reviewing past data, it seems that the problem appear on August 8th. Looking at my download folder, I downloaded "RTK_Surveyor_Firmware_v15RC-Aug6.bin" on August 7th evening. I found first missing data on August 7th at 22:31. So, I'm quite sure that reboots appear after this firmware update.

@Eric-FR
Copy link
Author

Eric-FR commented Sep 2, 2021

Yesterday, I made measurement WITHOUT external power. I still have reboot:

image

image

Forward and backward travels with NTRIP corrections.
Static stay on parking without correction nor connexion.

Zoom on the previous figure:

image

So external powering is not the source of the trouble. NTRIP/BT connexion seems also not to be the problem.

@nseidle
Copy link
Member

nseidle commented Sep 2, 2021

I also see lots of reboots with RC-Aug13, with RAWX logging, 4Hz updates, and BT connected with NTRIP. While I am not yet certain why Aug13 is such a problematic version, I am having very good stability with Aug23 and Sept2 binaries. I'll keep looking at the file diffs to see if I can see why Aug13 became problematic. In the mean time, please try RC-Sept2.

@nseidle
Copy link
Member

nseidle commented Sep 2, 2021

I did not know, but should have known there is a Lefebure desktop version. This will allow me to run much longer tests (multi-day) with Bluetooth connected.

@Eric-FR
Copy link
Author

Eric-FR commented Sep 2, 2021

I just made a 2 hours trial on my balcony without BT connexion. I still can see missing data. So BT is not the problem.

I think the problem appears with Aug6.

Will make a try with Sept2.

@Eric-FR
Copy link
Author

Eric-FR commented Sep 2, 2021

Update to Sept2 done.

Indeed, as I'm no staying in front of the Express display, I'm not sure about what's happening.

Fore sure, quite often (~1 mn?), I can see the SD icon blinking (instead of the progress bar inside).

On Lefebure client, I can see connection lost:

Screenshot_20210902-224558_Lefebure NTRIP Client

And I didn't see reboot on the Express at the same time. I will look at data later.

@Eric-FR
Copy link
Author

Eric-FR commented Sep 2, 2021

Data are also missing at the same time:

image

@Eric-FR
Copy link
Author

Eric-FR commented Sep 4, 2021

I downgraded to firmware 1.4.
I ran a 8 hours test this night with 4 Hz rate and NMEA+RAWX+NAV messages with BT/NTRIP correction. Data are fine. So, the problem is in the v1.5.

@nseidle
Copy link
Member

nseidle commented Sep 12, 2021

I'm trying hard to reproduce this but I'm having problems. I ran a 4 day log + Lefebure desktop without a single reset. I realize you are logging (and using the logs to detect system resets). Do you have any idea if the system is resetting when then microSD card is not inserted? It would help me pin down where to look.

@Eric-FR
Copy link
Author

Eric-FR commented Sep 14, 2021

I did several runs without µSD. Antenna was installed on top a tripod on my balcony, the receiver being just below. No physical interactions during measurements.
Receiver connected with Lefebure on smartphone.
I usually used wifi connection to get NTRIP correction but I also tried with cellular.
I tried with defaults parameters but also adjustment to my usual parameters using wifi configuration.
Soon or later, I observe restart in the reception on smartphone. See attached picture.

Screenshot_20210914-223705_Lefebure NTRIP Client

At 10:35:28, you see that BT connection with receiver is lost. We get it back several seconds latter.

@nseidle
Copy link
Member

nseidle commented Sep 15, 2021

Good feedback! Thanks. It's helpful to know the reset occurs regardless of microSD presence.

@pyrog
Copy link

pyrog commented Sep 15, 2021

It seem an BT issue?
Is it possible to send RTCM stream by WIFI or USB?

@nseidle
Copy link
Member

nseidle commented Sep 15, 2021

Yes it is possible to send RTCM over USB. WiFi would be far more complicated than BT.

@Eric-FR
Copy link
Author

Eric-FR commented Sep 15, 2021

Considering that the problem also occurs without BT connection (but with SD card), one test would be to completely shut down the BT on the ESP32. Is it possible with the present firmware? Or could you add such an option? Anyhow, it would be interesting for people only willing to do post-processing.

@nseidle
Copy link
Member

nseidle commented Sep 15, 2021

@Eric-FR - I missed your earlier msg about BT being turned off as well.

I can replicate. The reset is very odd in that the ESP32 doesn't report its stack trace. Is the stack trace getting corrupted by a non-thread safe task?:

14:43:29.345 -> [D][BluetoothSerial.cpp:302] esp_spp_cb(): ESP_SPP_CONG_EVT: CONGESTED
14:43:29.486 -> Rover Accuracy (m): 0.0141
14:43:30.140 -> ets Jul 29 2019 12:21:46
14:43:30.140 -> 
14:43:30.140 -> rst:0x8 (TG1WDT_SYS_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
14:43:30.140 -> configsip: 0, SPIWP:0xee
14:43:30.140 -> clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
14:43:30.140 -> mode:DIO, clock div:1
14:43:30.140 -> load:0x3fff0018,len:4
14:43:30.140 -> load:0x3fff001c,len:1216
14:43:30.186 -> ho 0 tail 12 room 4
14:43:30.186 -> load:0x40078000,len:10944
14:43:30.186 -> load:0x40080400,len:6388
14:43:30.186 -> entry 0x400806b4
14:43:30.840 -> ZED-F9P firmware: HPG 1.13
14:43:30.840 -> 
[D][esp32-hal-adc.c:164] __analogReadMilliVolts(): eFuse Vref: Supported
14:43:30.840 -> [I][esp32-hal-adc.c:194] __analogReadMilliVolts(): ADC1: Characterized using eFuse Vref: 1128
14:43:30.840 -> 
14:43:30.840 -> SparkFun RTK Express v1.5-Sep 15 2021
14:43:30.840 -> Reset reason: ESP_RST_INT_WDT

@nseidle
Copy link
Member

nseidle commented Sep 15, 2021

It seems the Lefebure from my Android causes a significant amount of SPP congestion that the desktop Lefebure does not. I suspect it is the different Bluetooth chipsets (USB dongle vs. Phone).

Ah! Ok I think perhaps I have it. The read and write buffers were set to 4k where as the SPP buffer was set to 2k. When transmitting larger data streams (ie, RAWX and SXFRB) the SPP connection would become overwhelmed. By lowering the the UART buffer size to 2k to match SPP buffer, the SPP congestion disappears!

Other changes:

  • Read and Write tasks are now priority 1. This is an increase (originally 0 = lowest, 3 = highest) and should equate to better logging performance but I need to verify this.
  • The ButtonCheck task is now priority 0. This means the buttons are checked slightly less often but I don't notice it. Let me know if the user experience is noticeably worse when using the SETUP button for menu selection.

Please give the latest Sept15 binary a try and let me know if you see any improvement and/or failures.

@Eric-FR
Copy link
Author

Eric-FR commented Sep 17, 2021

So, I made a static test this night with poor sky view. With µSD, with NMEA+RAWX recording at 1 Hz and RTCM form Lefebure android smartphone.

There was a suspicious event with fix lost but looking at ubx log file and converting it to rinex with RTKCONV, no raw data are missing. There was a single (1 s) jump of the instantaneous position 20 km away. So, it seems to be right. I will do dynamic test this week-end.

BTW, one may also reduce the load on the BT by not sending all messages to the smartphone. Indeed, I only need NMEA on the smartphone but all of the messages I asked on the µSD.

@nseidle
Copy link
Member

nseidle commented Sep 17, 2021

Thanks for reporting. I've been running a battery of tests and I'm getting lots of reboots. There's still a problem lurking.

Setup: 4 Hz, 7 messages (NMEA 5 +RAW 2) enabled.

Reboots with:

  • Full firmware: yes
  • Logging removed: yes
  • Button task removed: yes
  • WiFi/BT removed: yes

At this point I'm beginning to think it may be an I2C issue with the ZED-F9P that become worse as more messages are enabled. I've designed in a reset counter that is displayed on the OLED so it is much easier to see/catch reboots (rather than having to look at logs or terminal output).

BT - yes, we could reduce BT load but there's a few issues. 1) That's not currently what is resetting my units so I'm less concerned about BT congestion. 2) What is sent over BT is also what is logged. I can design around this (and log all data over I2C) but I worry it will pose it's own problems. I'll leave the BT traffic issue for another day :)

@Eric-FR
Copy link
Author

Eric-FR commented Sep 17, 2021

My own logger is using the following design (at bottom with I2C):https://github.com/PaulZC/F9P_RAWX_Logger/blob/master/HARDWARE.md

I2C is used for F9P configuration while UART1 is used for data transfer. I added a BT module on UART2 for incoming RTCM and NMEA output. It is very stable but the big drawback is that you can't connect to the adalogger for configuration or whatever you want. (And I found on a forum that PaulZC indeed struggled to implement buffering of incoming data from the F9P, due to the high quantity).

@Eric-FR
Copy link
Author

Eric-FR commented Sep 19, 2021

Yesterday, I did pedestrian mapping. Data were right.

Yesterday evening, I upgraded to stable v1.5.

Today, I drove on the highway and I observed several reboot. Indeed, the behaviour on data is not exactly the same. It looks like F9P didn't restart at the same time.

I will elaborate on it later.

@nseidle
Copy link
Member

nseidle commented Sep 20, 2021

After a weekend filled full of overnight tests, I think I've solved the underlying issue. NMEA should have been turned off of the I2C interface in this commit but COM_TYPE_UBX | COM_TYPE_RTCM3 is not a valid setting for a port. This combination causes the module to go into COM_TYPE_UBX | COM_TYPE_I2C | COM_TYPE_RTCM3 re-enabling NMEA. If there are messages enabled to be sent over I2C (and they are by default) then the I2C bus gets a lot of traffic at 4Hz. The ZED is known to have some odd I2C errors (1, 2). My current theory is these illegal I2C states are causing the ESP32 to lockup without a stack trace. I cannot verify but with the solution below, I have been able to run multiple 20+ hours tests on multiple units without system resets.

The solution was to be purely in UBX protocol on the I2C interface, and to turn off all NMEA messages from the I2C interface. If we need RTCM (for example during WiFi NTRIP broadcasting) then we re-enable NMEA+RTCM on I2C during broadcast. Because broadcast is 1Hz, and NMEA messages are disabled, I have not seen a unit reset during NTRIP broadcast over WiFi either.

I am still seeing bytes dropped during logging. I would have hoped the priority increase on the record task would have aided the logging issue but it has not.

image

I am going to leave this issue open for a bit. Please test with v1.6 of the firmware.

@Eric-FR
Copy link
Author

Eric-FR commented Sep 21, 2021

Should I keep my test sample a couple of more days to check the new firmware? I'm usually testing on week-end.

About my last week-end test, there are two new points:

  • while the ESP32 was rebooting, the F9P was not and I was instantaneously getting RTK fix at the moment recording was resumed. Before, I was getting back to start with first GPS fix, sometime DGPS, then FloatRTK and finally RTK.
  • crashes were likely appearing at high speed or, at least, while moving with car. This may explained why data were OK in static or pedestrian measurements. Also, at the end of travel, I was stuck in traffic jam and I didn't get reboot. I didn't noticed such behavior with earlier versions of firmware. Especially, I was observing reboot in static condition with earlier versions.

@pyrog
Copy link

pyrog commented Sep 22, 2021

Your workbench

  • RTK_Surveyor_Firmware_v16.bin
  • Default settings

Log of 1 hour and 26 mn : a walk along a lake.
4 reboots where visually detected (no measurements during 10-15 seconds).
Got 4 messages ESP_RST_INT_WDT in the log

2021-09-21T15:21:17.250Z
2021-09-21T15:34:34.250Z
2021-09-21T16:10:11.500Z
2021-09-21T16:29:53.750Z


Log file sender in private mail

@nseidle
Copy link
Member

nseidle commented Sep 22, 2021

Arg. This is a hard bug to pin down. Thank you for taking the time to test. A few things:

  • For easy identification of reboots, you will find a new option in the debug menu to enable reset count. This will turn on a small number between the SIV and logging icon that will indicate number of resets since power on. I test the firmware by turning on two units (RTK Express and RTK EXpress Plus) and letting them run for a few hours. This number is much easier and faster for me to monitor than checking logs.

image
(Above image showing '0' resets)

I do not yet have a good explanation as to why the resets are occurring. I will do some more testing and get you a new binary in the coming days.

@Eric-FR
Copy link
Author

Eric-FR commented Sep 22, 2021

I'm also getting numerous:
$GNTXT,01,01,01,ESP_RST_INT_WDT*08

I will look at the reset counter.

@pyrog
Copy link

pyrog commented Sep 23, 2021

Two more tests to go to work this night
No reboot in the outward journey but corrupted data at the beginning of the log file

$GNTXT,01,01,01,ESP_RST_POWERON*1D
”�aC �L∞�Ç����(
���� ����“îRìîUí‘�ë§_Ù�≤˘ú¨�ÊÕ≠�ËÕ�Ö,Ëÿl�ß∏n<áûŒ'~‹Ü�¥ò�p�EŸ�≠�ÙZ¿�î˛:€‘’EUP�À∞∫⁄n∑�h!<”�äC¿�quŸB��`∏`����� Ä���}¢¶¶&"ß*©ç�ã∏‚�‚»�O�±<„˚�Ú†à0ÈA´F“åq!2@Wõ^;Ït¡Ôã�bŸ˛√�˘≠‡à
�ʶ�ôp3f@∆À��Ñ�ˆ◊Œdæ7�y
=UUU5SMT�%ÛcVU]ï∏SM≥������������� Ëc”�7D`�L∞�Ç���Ñ�@���� ���zöã�Æ:��Ÿ0Ωëw$aª¸�Ôº�àpGS�{0’=�ˆUhI»-”�?F@�LØ9¿��� 	@Ä��� ���~�e9I;˚�™-)‘ÊJ�†õ@¥Õ�?2$�Í‚�iÄ-�^øïƒ‘»)≤˚/øW±$GNRMC,172034.00,V,,,,,,,220921,,,N,V*10
$GNGGA,172034.00,,,,,0,00,99.99,,,,,,*7B
$GNGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99,1*33
$GNGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99,2*30
$GNGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99,3*31
$GNGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99,4*36
$GNGST,172034.00,0.0000,2546276,2546276,0.0000,1024246,1024246,1024246*53
”�aC �L∞$"����(
���� ����“îRìîUí‘�ë§gÛ�rŸú≠�fÕ£ª1Iw·@Ï∻ᖃìá^…3˚%(�rflÇ·?cÅ�Y˙�W`1&·∫…T’EUP�À∞∫⁄ƪ�µ>a”�äC¿�quË‚��`∏`����� Ä���}¢¶¶&"ß*©ç0{º„�‚∏�Oi´–Δ+¢¿[ é•˝¸ywΩŸ
∞�í'¬ÔÉ~è�˘‡∑Ù…/—∆†ZŒÅ1¥�‡∑‡�fl{/˚��·g �_æêˇ+ùUUU5UKT�%ÛcV5]ï∏S-≥�������������◊r¢”�7D`�L∞$"���Ñ�@���� ���zöã�Æ<�É�6ÇêÛª"tóz�Ï�kÔ±Î~é∫U=�ˆMp�¿∂”�?F@�LØI`��� 	@Ä��� ���~�e9I;˚�*-1”÷™;ß|>aì(æµ9Éπ©ÔäÔòD�ìÚD‘»)≤˚/‹û $GNRMC,172035.00,V,,,,,,,220921,,,N,V*11
$GNGGA,172035.00,,,,,0,00,99.99,,,,,,*7A
$GNGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99,1*33
$GNGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99,2*30
$GNGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99,3*31
$GNGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99,4*36
$GNGST,172035.00,0.0000,2546601,2545950,173,1024246,1024246,1024246*71
”�aC �L∞3¬����(
���� ����“îRìîUí‘�ë§kÚ�ÚπúØ�Ê›ô�yE«}ñÔ’Äìh¡�"áfi√√�mÇ��ÁÙ6ˇ~�˜˜mÏu0GΩa;�T’EUP�ÀØ∫⁄n∑�=.µ”�äC¿�qu¯Ç��`∏`����� Ä���}¢¶¶&"ß*©ç0k¿‰ù‚†�Os∞‰˜™WP¯1Ä:∫M2(c��* kÖYFíäfl/�|a
ÍîO®Û@1��äŸÈ��çÒ~0˘Ç�
�Ùfl≈�·BS�:ßUUU5UMT�%ÛaVU_ï∏R›s�������������4។�7D`�L∞3¬���Ñ�@���� ���zöã�Æ<��9(Hk!∞
BÅ�ò�'¯�Ï!§�’E�ˆUp�B<”�?F@�LØY���� 	@Ä��� ���~�e9I;˚�™�5”«�gå^��¿–û8{ÜyF�ƒ»�á g‚L‘»)≤˚/ Ñ$GNRMC,172036.00,V,,,,,,,220921,,,N,V*12
$GNGGA,172036.00,,,,,0,00,99.99,,,,,,*79
$GNGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99,1*33

For the return journey, 2 ESP_RST_INT_WDT in the file.
I saw F9P leds blinking normally, but RTK leds off during several seconds.

@pyrog
Copy link

pyrog commented Sep 24, 2021

Another test in v1.6

Trip Length Duration  Reboot  Coordinates imported Malformed sentences Checksum errors Zero Coordinates
Outward 54.4 km  59 mn 4 14 056 217 282 124
Return 57.8 km 1h 16 mn 4 18 275  310 595 34

Some corrupted sentences (with 2 $)

Trip Count
Outward 246
Return 380
$GNGSA,A,3$GNRMC,044802.00,A,4655.9178588,N,00553.0148696,E,0.006,,240921,,,A,V*13
$GN$GNRMC,052656.75,A,4648.5594416,N,00554.7458518,E,44.251,204.93,240921,,,R,V*29
$GNGSA,A,3,65,66,76,77,67$GNTXT,01,01,01,DGNSS long baseline (100 km)*3A
$GNGSA,A,3,66,76,77,67,,,$GNTXT,01,01,01,DGNSS long baseline (102 km)*38
$GNGSA,A,3,66,76,77,67,,,,,$GNTXT,01,01,01,DGNSS long baseline (102 km)*38
$GNGST$GNTXT,01,01,01,DGNSS long baseline (103 km)*39
$GNGST,180251.50,16,0.17,0.11$GNTXT,01,01,01,DGNSS long baseline (106 km)*3C

$GNRMC and $GNTXT seem overlaps previous sentences

@nseidle
Copy link
Member

nseidle commented Sep 24, 2021

Be careful not to mix issues. We are working on unexpected reboots on this issue, not necessarily corrupt sentences.

That said, your feedback pyrog is excellent. Please keep it coming! I recommend you report your findings on this issue if it is similar enough, or start a new one. Please include your settings file or a note about 'same settings as xyz' so that I can try to replicate.

Wrt unexpected reboots, I think I have a way forward. I've been testing v1.7 binaries for the last few days and have had good success. Two things changed: the way SPI was initiated and the ESP32 Arduino core was updated from v1.0.6 to v2.0.0. There was quite a few changes to the core that may help us but I suspect the real fix was at the IDF level.

For this reboot issue, please begin testing with v1.7RC and let me know how you fare.

@Eric-FR
Copy link
Author

Eric-FR commented Sep 24, 2021

The starting point is that some data corruptions correspond to reboot. So, it is easier to find reboot bu looking at corrupted data.

Will upgrade to v1.7RC.

@Eric-FR
Copy link
Author

Eric-FR commented Sep 26, 2021

After firmware upgrade, I made two tests with NMEA+RAWX logging:

  • static overnight test with 1/3 with RTCM correction and 2/3 without correction
  • dynamic/bicycle test for 40 mn.
    No apparent reboot (on Express screen or in Lefebure). No trace of reboot in log files. No gap in data in log files.
    Still need to perform driving test.

@pyrog
Copy link

pyrog commented Sep 26, 2021

I made several tests with v1.7RC and the default settings (NMEA at 4Hz).

Description Duration  Length Reboot Malformed sentences Comments
Static test without BT 1h45 - 0 1 Did i play with WIFI AP during this test ?
Static test with BT 1h57 - 0 0
Car trip 0h44 42 km 0 0
Trek 2h35 8  km 0 0
Car return trip 0h38 27 km 0 4 Data corruption happened when I tried to setup WIFI AP (without success)

@pyrog
Copy link

pyrog commented Sep 26, 2021

Several small test this afternoon with same settings

Description Duration  Length Reboot Malformed sentences Comments
walk 0h15 872m 0 35 Errors during the first minute
walk 0h43 530m 0 0
walk 0h21 1838m 0 0

@nseidle
Copy link
Member

nseidle commented Sep 27, 2021

Thank you for testing the latest RC. It looks like we've pinned down the reboot issue. There's still work to fix the malformed sentences but I'm putting this bug to bed. Please open a new issue if the issue persists.

@nseidle nseidle closed this as completed Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants