Skip to content

Ethernet instable, kind of deadlocking the controller; Nucleo F429ZI #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pulli7 opened this issue Sep 12, 2017 · 10 comments · Fixed by #5
Closed

Ethernet instable, kind of deadlocking the controller; Nucleo F429ZI #3

pulli7 opened this issue Sep 12, 2017 · 10 comments · Fixed by #5

Comments

@pulli7
Copy link

pulli7 commented Sep 12, 2017

I'm expiriencing some issues while using the F429ZI as a http-server. (quite simple application, where a few controller pins are set according to http requests from clients)

It generally does what it is meant to do, but at some point clients can not reach the server any more. If client is a web browser, opening a new tab solves this temporarely, but soon or later the controller gets kind of locked, meaning it is completly unreachable, and even code in main loop does not seem to get executed any more (tested by attaching switch on one pin, which turns on/off another GPIO).

There does not seem to be a specific number of requests or clients leading to failure, sometimes two or three requests from one single client already lead to a complete lock, sometimes it takes a few thousands from different clients, but at some point it always fails.
One strange thing is, that requests from webbrowser seem to lead to failure way faster, than from other applications like Curl, especially when multiple clients are involved. Maybe some timing issue??

The behavior can be reproduced with the server-example from the library. Just starting the server and then have for example five or six Curl clients throwing requests at it every few seconds, makes it fail quite fast most of the time, especially when also sending requests from a webbrowser manually.
I used this simple bat script for testing:

for /L %%N IN (1, 1, 10000) DO (
echo Nummer %%N
t:\curl.exe 10.66.22.124
ping -n 2 127.0.0.1 > NUL 
)

I hope there is someone who can also reproduce this and is familiar with the ethernet implementation, because I absolutely have no clue how to track this down...

@fpistm fpistm assigned ghost Sep 12, 2017
@fpistm fpistm added the bug label Sep 12, 2017
@ghost
Copy link

ghost commented Sep 13, 2017

I confirm the bug. But it is hard to reproduce it.

If you have time to start to track it you can enable the debug trace of LwIP:

  • LwIP/src/lwip/debug.h and LwIP/src/lwip/opt.h to see debug flags
  • Add debug flags in variants/NUCLEO_F429ZI/Ethernet/lwipopts.h (LWIP_DEBUG, TCP_DEBUG to begin)

My time is limited currently but as soon as possible I will take a closer look.

@pulli7
Copy link
Author

pulli7 commented Sep 20, 2017

Took a look at LWIP debug output meanwhile. Looks like some kind of memory leak,
MEMP is complaining about memp_malloc: out of memory in pool PBUF_POOL as soon as the Controller is getting unresponsitive.
PBUF also gives some message at that point, which I am not quite sure about the meaning, but tends to be something like buffer overflow or problem clearing a buffer:
Although I have to admit that I have no knowledge about the internals of TCP-Communication at all, so could be completly wrong here...

pbuf_free(0x8edfe)
pbuf_free: 0x8edfe has ref 65534, ending here.

LogFiles of debug output:
TCP_Log.txt
TCP_MEMP_log.txt
PBUF_Log.txt

Also noticed that the bug also occurs when using a single tcp connection, which is just connected once and then held open, so it does not seem to be related to the process of connecting/disconnecting clients.

@ghost
Copy link

ghost commented Sep 22, 2017

Thank you @pulli7 for your useful log files.

I reproduced this issue too. Same message before the crash: memp_malloc: out of memory in pool PBUF_POOL
What I can propose to you it is to increase the LwIP allocated memory size in lwipopts.h under the parts
/* ---------- Memory options ---------- */ and /* ---------- Pbuf options ---------- */. You can "play" with MEM_SIZE, MEMP_NUM_PBUF and PBUF_POOL_SIZE.
I can't help you more because it depends on your application.

Let me know if you have fixed this issue.

@pulli7
Copy link
Author

pulli7 commented Sep 22, 2017

Just to be clear, I first discovered the bug in my own application, but everything I describe here, was done with the WebServer example, that comes with the library! Only modification was change of Ip-address, and in case of the the logs removing the serial prints.
It definitly is a general issue, not limited to my own sketch.

I did some testing with the memory options. It has some impact on the behavior, but I was not able to solve the bug.
The average time until a crash occurs increases significantely, when tuning the values up ( PBUF_POOL_SIZE seems to have the biggest impact).
But even with extreme settings, like making all three values four times bigger than standard, I still often get those cases, where the crash occurs just after less than 100 requests...

@ghost
Copy link

ghost commented Sep 26, 2017

I upgraded the LwIP library to the version 2.0.3.
Can you test with the new version? here

Furthermore, just for information, a link about the memory configuration of the LwIP stack.

@pulli7
Copy link
Author

pulli7 commented Sep 26, 2017

Thank you for your effort to solve this issue.
I will not be able to get my hands on the board during the next two weeks, as I am in another place right now. But I will definitely do some more detailed testing with the new version, and report back here, as soon as I return.

@pulli7
Copy link
Author

pulli7 commented Oct 9, 2017

Tested LwIP 2.0.3 with the Webserver sketch from the library, sadly the behaviour is the same as before...

I then enabled stat_display(), as described in the information about LwIPs memory config, to take a closer look at the buffers. There seems to be a problem with MEM TCP_PCB; it gets packed up completly very fast. Maybe a problem with closing connections? Just guessing here...

When continuing to send requests after MEM TCP_PCB is full, the heap also gets packed up completly. It even gets bigger than the maximum value set, likely corrupting the RAM until something critical is hit and crashes.

Logfile: (only one single Google Chrome tab used as client)
LogStat_display.txt

I tried giving LwIP more heap and increased the MEMP_NUM_TCP_PCB define, to see if the increasing memory usage stops at some point, but that seems not to happen. MEM TCP_PCB always gets packed up, no matter how many simultanious connections I allow. And even a quite big heap like 50kB overflows after some time, and leads to crash:

Logfile: (again one single Google Chrome tab as a client)
Log_Heap=50kB_TCP_PCB=20.txt

One other thing I noticed, but think is not that relevant for the issue, just want to mention it:
Setting really large values for MEM_SIZE does not work, as it leads to problems with memory allocation.
Getting Error: mem_malloc: could not allocate _xx_ bytes
Logfile:
Log_Heap=140kB_TCp_PCP=50.txt

@ghost ghost mentioned this issue Oct 16, 2017
@ghost
Copy link

ghost commented Oct 16, 2017

Hi @pulli7

I took the time but I think I have something that can resolve your issue.
Please could you try the PR #5 ?

With this fix, MEM HEAP doesn't increase anymore until the crash.

Another precision: MEM TCP_PCB increase until its maximum but it is normal because the TCP stack needs about 2 minutes before to delete a pcb. If the LwIP stack needs to allocate a new pcb, an old one will be removed faster (if in CLOSED state).

Please keep me inform.

@pulli7
Copy link
Author

pulli7 commented Oct 18, 2017

Thank you very much @fprwi6labs !

Tested this morning, it now works perfectly stable!
With the Webserver example and default settings in lwipopts.h, heap usage constantly stays below 2kB for me.

Issue can be marked as resolved.

@fpistm
Copy link
Member

fpistm commented Oct 18, 2017

Thanks @pulli7 for your tests and @fprwi6labs for the fix.
I will merge it when the cb mechanism will be removed in the PR.

@fpistm fpistm closed this as completed in #5 Oct 20, 2017
fpistm added a commit that referenced this issue Oct 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants