-
Notifications
You must be signed in to change notification settings - Fork 52
Unexpected reboot #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for reporting! Can you interpret the time scale for me - is there five gaps of data between ~22:27 to ~22:38? So 5 reboots in about 11 minutes? From your settings file: You have SBAS enabled but QZSS disabled (that's interesting). Why? You have SFRBX and RAWX enabled and you are logging. You are sending NTRIP over Bluetooth. You have the latest ZED firmware (v1.13). Is there anything else I should know to try to replicate? |
Yes, I had several reboots in a couple of minutes. After reviewing past data, it seems that the problem appear on August 8th. Looking at my download folder, I downloaded "RTK_Surveyor_Firmware_v15RC-Aug6.bin" on August 7th evening. I found first missing data on August 7th at 22:31. So, I'm quite sure that reboots appear after this firmware update. |
Yesterday, I made measurement WITHOUT external power. I still have reboot: Forward and backward travels with NTRIP corrections. Zoom on the previous figure: So external powering is not the source of the trouble. NTRIP/BT connexion seems also not to be the problem. |
I also see lots of reboots with RC-Aug13, with RAWX logging, 4Hz updates, and BT connected with NTRIP. While I am not yet certain why Aug13 is such a problematic version, I am having very good stability with Aug23 and Sept2 binaries. I'll keep looking at the file diffs to see if I can see why Aug13 became problematic. In the mean time, please try RC-Sept2. |
I did not know, but should have known there is a Lefebure desktop version. This will allow me to run much longer tests (multi-day) with Bluetooth connected. |
I just made a 2 hours trial on my balcony without BT connexion. I still can see missing data. So BT is not the problem. I think the problem appears with Aug6. Will make a try with Sept2. |
Update to Sept2 done. Indeed, as I'm no staying in front of the Express display, I'm not sure about what's happening. Fore sure, quite often (~1 mn?), I can see the SD icon blinking (instead of the progress bar inside). On Lefebure client, I can see connection lost: And I didn't see reboot on the Express at the same time. I will look at data later. |
I downgraded to firmware 1.4. |
I'm trying hard to reproduce this but I'm having problems. I ran a 4 day log + Lefebure desktop without a single reset. I realize you are logging (and using the logs to detect system resets). Do you have any idea if the system is resetting when then microSD card is not inserted? It would help me pin down where to look. |
I did several runs without µSD. Antenna was installed on top a tripod on my balcony, the receiver being just below. No physical interactions during measurements. At 10:35:28, you see that BT connection with receiver is lost. We get it back several seconds latter. |
Good feedback! Thanks. It's helpful to know the reset occurs regardless of microSD presence. |
It seem an BT issue? |
Yes it is possible to send RTCM over USB. WiFi would be far more complicated than BT. |
Considering that the problem also occurs without BT connection (but with SD card), one test would be to completely shut down the BT on the ESP32. Is it possible with the present firmware? Or could you add such an option? Anyhow, it would be interesting for people only willing to do post-processing. |
@Eric-FR - I missed your earlier msg about BT being turned off as well. I can replicate. The reset is very odd in that the ESP32 doesn't report its stack trace. Is the stack trace getting corrupted by a non-thread safe task?:
|
It seems the Lefebure from my Android causes a significant amount of SPP congestion that the desktop Lefebure does not. I suspect it is the different Bluetooth chipsets (USB dongle vs. Phone). Ah! Ok I think perhaps I have it. The read and write buffers were set to 4k where as the SPP buffer was set to 2k. When transmitting larger data streams (ie, RAWX and SXFRB) the SPP connection would become overwhelmed. By lowering the the UART buffer size to 2k to match SPP buffer, the SPP congestion disappears! Other changes:
Please give the latest |
So, I made a static test this night with poor sky view. With µSD, with NMEA+RAWX recording at 1 Hz and RTCM form Lefebure android smartphone. There was a suspicious event with fix lost but looking at ubx log file and converting it to rinex with RTKCONV, no raw data are missing. There was a single (1 s) jump of the instantaneous position 20 km away. So, it seems to be right. I will do dynamic test this week-end. BTW, one may also reduce the load on the BT by not sending all messages to the smartphone. Indeed, I only need NMEA on the smartphone but all of the messages I asked on the µSD. |
Thanks for reporting. I've been running a battery of tests and I'm getting lots of reboots. There's still a problem lurking. Setup: 4 Hz, 7 messages (NMEA 5 +RAW 2) enabled. Reboots with:
At this point I'm beginning to think it may be an I2C issue with the ZED-F9P that become worse as more messages are enabled. I've designed in a reset counter that is displayed on the OLED so it is much easier to see/catch reboots (rather than having to look at logs or terminal output). BT - yes, we could reduce BT load but there's a few issues. 1) That's not currently what is resetting my units so I'm less concerned about BT congestion. 2) What is sent over BT is also what is logged. I can design around this (and log all data over I2C) but I worry it will pose it's own problems. I'll leave the BT traffic issue for another day :) |
My own logger is using the following design (at bottom with I2C):https://github.com/PaulZC/F9P_RAWX_Logger/blob/master/HARDWARE.md I2C is used for F9P configuration while UART1 is used for data transfer. I added a BT module on UART2 for incoming RTCM and NMEA output. It is very stable but the big drawback is that you can't connect to the adalogger for configuration or whatever you want. (And I found on a forum that PaulZC indeed struggled to implement buffering of incoming data from the F9P, due to the high quantity). |
Yesterday, I did pedestrian mapping. Data were right. Yesterday evening, I upgraded to stable v1.5. Today, I drove on the highway and I observed several reboot. Indeed, the behaviour on data is not exactly the same. It looks like F9P didn't restart at the same time. I will elaborate on it later. |
After a weekend filled full of overnight tests, I think I've solved the underlying issue. NMEA should have been turned off of the I2C interface in this commit but The solution was to be purely in UBX protocol on the I2C interface, and to turn off all NMEA messages from the I2C interface. If we need RTCM (for example during WiFi NTRIP broadcasting) then we re-enable NMEA+RTCM on I2C during broadcast. Because broadcast is 1Hz, and NMEA messages are disabled, I have not seen a unit reset during NTRIP broadcast over WiFi either. I am still seeing bytes dropped during logging. I would have hoped the priority increase on the record task would have aided the logging issue but it has not. I am going to leave this issue open for a bit. Please test with v1.6 of the firmware. |
Should I keep my test sample a couple of more days to check the new firmware? I'm usually testing on week-end. About my last week-end test, there are two new points:
|
Your workbench
Log of 1 hour and 26 mn : a walk along a lake.
Log file sender in private mail |
I'm also getting numerous: I will look at the reset counter. |
Two more tests to go to work this night
For the return journey, 2 |
Another test in v1.6
Some corrupted sentences (with 2
|
Be careful not to mix issues. We are working on unexpected reboots on this issue, not necessarily corrupt sentences. That said, your feedback pyrog is excellent. Please keep it coming! I recommend you report your findings on this issue if it is similar enough, or start a new one. Please include your settings file or a note about 'same settings as xyz' so that I can try to replicate. Wrt unexpected reboots, I think I have a way forward. I've been testing v1.7 binaries for the last few days and have had good success. Two things changed: the way SPI was initiated and the ESP32 Arduino core was updated from v1.0.6 to v2.0.0. There was quite a few changes to the core that may help us but I suspect the real fix was at the IDF level. For this reboot issue, please begin testing with v1.7RC and let me know how you fare. |
The starting point is that some data corruptions correspond to reboot. So, it is easier to find reboot bu looking at corrupted data. Will upgrade to v1.7RC. |
After firmware upgrade, I made two tests with NMEA+RAWX logging:
|
I made several tests with v1.7RC and the default settings (NMEA at 4Hz).
|
Several small test this afternoon with same settings
|
Thank you for testing the latest RC. It looks like we've pinned down the reboot issue. There's still work to fix the malformed sentences but I'm putting this bug to bed. Please open a new issue if the issue persists. |
Express.
RTK_Surveyor_Firmware_v15RC-Aug13
I was doing some testing in static conditions with Lefebure client on my smartphone. Then, I saw in Lefebure "No data" for several seconds followed by GPS, DGPS, FloatRTTK and RTK. This appended several times. Looking at the Express's screen, the Express was restarting each time, first this splash screen, second with "Rover" and back to recording.
The attached picture is the ubx file opened in RTKLib/RTKPlot as a solution. One can see missing data.
If we convert the ubx file to rinex with RTKConv and look at it with RTKPlot as observation data, data are also missing time to time.
Looking at dynamic recording, missing data were also observed (open field road).
UBX file zipped:
SFE_Express_210828_222211.zip
Express settings:
SFE_Express_Settings.txt
Investigation are under way to check if the same problem was present in earlier recording.
The text was updated successfully, but these errors were encountered: