Open
Description
Platform: Giga R1
Arduino Core Mbed 4.1.5
HttpClient: 0.6.1
Here's a sequence of events with a crudely instrumented MbedClient. We connect using ArduinoHttpClient every 1s. Every 5th connection fails with error -3005 from static_cast<TCPSocket *>(sock)->open(getNetwork());
in the MbedClient::connect() call
. When it fails it takes 30s to return. This is in version 4.1.5. In version 4.1.1 it would never recover and continue to fail until device reboot. There would be no 30s delay. This issue occurs on any http server I've tried. I can continue to connect to the server in the code with python on my Mac with no problems. If I extend the delay so I connect to the server once every 5s the problem goes away. At 4s it reappears.
22:10:12.229 -> Fetching content length from httpforever.com:80
22:10:12.393 -> Return code from connect0
22:10:13.213 -> took=993ms length: 5124
22:10:15.229 -> Fetching content length from httpforever.com:80
22:10:15.394 -> Return code from connect0
22:10:16.216 -> took=979ms length: 5124
22:10:18.198 -> Fetching content length from httpforever.com:80
22:10:18.365 -> Return code from connect0
22:10:19.225 -> took=1008ms length: 5124
22:10:21.203 -> Fetching content length from httpforever.com:80
22:10:21.366 -> Return code from connect0
22:10:22.222 -> took=1004ms length: 5124
22:10:24.202 -> Fetching content length from httpforever.com:80
22:10:24.202 -> Return code from TCPSocket::open-3005
22:10:54.295 -> took=30099ms length: -1
code
#include <Arduino.h>
#include <WiFi.h>
#include <HttpClient.h>
void setup() {
Serial.begin(115200);
delay(1000);
Serial.println("HI");
// put your setup code here, to run once:
auto wifi = WiFiInterface::get_default_instance();
wifi->set_dhcp(true);
wifi->set_credentials("ZZZZ", "ZZZZ", nsapi_security_t::NSAPI_SECURITY_WPA2);
Serial.println(wifi->connect());
}
void loop() {
// put your main code here, to run repeatedly:
WiFiClient client;
HttpClient httpClient(client, "httpforever.com", 80);
Serial.println("Fetching content length from httpforever.com:80");
auto then = millis();
httpClient.beginRequest();
httpClient.get("/");
httpClient.endRequest();
auto len = httpClient.contentLength(); // Doesn't read all content
Serial.print("took="); Serial.print((millis() - then)); Serial.print("ms "); Serial.print("length: "); Serial.println(len);
delay(5000);
}
Metadata
Metadata
Assignees
Labels
No labels
Activity
schnoberts1 commentedon Aug 15, 2024
... at 4.5s delay it takes 8 calls before it fails, an interesting multiple of 4. Any hints? I tried moving WifiClient globally but that swaps -3005 errors for -3003 errors on the first re-use:
31s timeout this time.
Since I've included the code this should replicate well.
schnoberts1 commentedon Aug 15, 2024
... I am also guessing this 30s timeout is in HttpClient since there's no delay on the return from the failed connect.
schnoberts1 commentedon Aug 16, 2024
Note: calling client.stop() does not resolve the issue
stopandchatchfire commentedon Sep 5, 2024
The maximum number of TCP connections is 4.
As defined in the file: "mbed_config.h"
#define MBED_CONF_LWIP_TCP_SOCKET_MAX 4
I don't understand why it's so low. Need to recompile locally to have more connections available.
schnoberts1 commentedon Sep 5, 2024
I never have 4 tcp sockets open but I am confident this limit is at the core of the issue. Some clean up isn’t being executed. Increasing the socket count isn’t going to help though as it just puts the issue off.
JAndrassy commentedon Sep 9, 2024
@schnoberts1 can you try these changes? https://github.com/arduino/ArduinoCore-mbed/pull/912/files
schnoberts1 commentedon Sep 9, 2024
Of course @JAndrassy. I've applied the patch and the issue still remains.
pennam commentedon Oct 1, 2024
@schnoberts1 the 30s timeout comes from the Http library, you can get rid of it adding
httpClient.setHttpResponseTimeout(1000);
to your sketch.I'm still investigating about the
NSAPI_ERROR_NO_SOCKET
return value but i agree it is related to theMBED_CONF_LWIP_TCP_SOCKET_MAX
define and in the way lwip TCP connections are deleted https://github.com/arduino/mbed-os/blob/e0cad5c4277b3c5ee9cb01a9529df9333612bbeb/connectivity/lwipstack/lwip/src/api/lwip_api_lib.c#L233It looks that before complete deletion they can be put in a waitstate and closed afterwards
https://github.com/arduino/mbed-os/blob/e0cad5c4277b3c5ee9cb01a9529df9333612bbeb/connectivity/lwipstack/lwip/src/api/lwip_api_msg.c#L913
JAndrassy commentedon Oct 1, 2024
does the HttpClient send the "Connection: close" header?
pennam commentedon Oct 1, 2024
@JAndrassy yes HttpClient sends the "Connection: close" header.
I think I've found the rootcause, and it is server side, in fact
httpforever.com
keeps sendingTCP Keep-Alive
messages, and thus connections are not closed. Using a different server likeexample.org
connections are closed correctly and theNSAPI_ERROR_NO_SOCKET
error is not generated..schnoberts1 commentedon Oct 1, 2024
pennam commentedon Oct 1, 2024
Here is what i get with wireshark, I'm now thinking if the TCP Window Full messages are somehow involved
wireshark.zip
pennam commentedon Oct 1, 2024
I get the same
NSAPI_ERROR_NO_SOCKET
using anothe http sitesneaindia.com
TCP Window Full
messages are also present.
snea.pcapng.zip
schnoberts1 commentedon Oct 1, 2024
5 remaining items
schnoberts1 commentedon Oct 1, 2024
megacct commentedon Oct 1, 2024
FYI, In addition to the mbed conf settings mentioned here #966 (comment), I've also set
compiler.c.extra_flags=-DSOCKET_BUFFER_SIZE=2048
compiler.cpp.extra_flags=-DSOCKET_BUFFER_SIZE=2048
in my platform local
megacct commentedon Oct 1, 2024
Hi @schnoberts1, I ran your sketch on my setup for 5mins with 5 sec delay, 5mins with 1 sec delay and for 2mins with 100ms delay - output attached.
output.zip
100ms
20:48:23.390 -> HI
20:48:29.958 -> 0
20:48:29.958 -> Fetching content length from httpforever.com:80
20:48:31.140 -> took=1178ms length: 5124
20:48:31.233 -> Fetching content length from httpforever.com:80
20:48:31.852 -> took=645ms length: 5124
20:48:31.947 -> Fetching content length from httpforever.com:80
20:48:32.608 -> took=641ms length: 5124
20:48:32.702 -> Fetching content length from httpforever.com:80
20:48:33.361 -> took=649ms length: 5124
20:48:33.455 -> Fetching content length from httpforever.com:80
20:48:34.117 -> took=646ms length: 5124
20:48:34.212 -> Fetching content length from httpforever.com:80
20:48:34.825 -> took=642ms length: 5124
20:48:34.966 -> Fetching content length from httpforever.com:80
20:48:35.579 -> took=645ms length: 5124
20:48:35.673 -> Fetching content length from httpforever.com:80
20:48:36.337 -> took=650ms length: 5124
20:48:36.432 -> Fetching content length from httpforever.com:80
20:48:37.094 -> took=647ms length: 5124
20:48:37.189 -> Fetching content length from httpforever.com:80
20:48:37.851 -> took=650ms length: 5124
20:48:37.946 -> Fetching content length from httpforever.com:80
20:48:38.608 -> took=652ms length: 5124
20:48:38.702 -> Fetching content length from httpforever.com:80
20:48:39.364 -> took=671ms length: 5124
20:48:39.460 -> Fetching content length from httpforever.com:80
20:48:40.120 -> took=654ms length: 5124
20:48:40.215 -> Fetching content length from httpforever.com:80
20:48:40.877 -> took=647ms length: 5124
20:48:40.972 -> Fetching content length from httpforever.com:80
20:48:41.628 -> took=645ms length: 5124
20:48:41.723 -> Fetching content length from httpforever.com:80
20:48:42.385 -> took=665ms length: 5124
20:48:42.481 -> Fetching content length from httpforever.com:80
20:48:43.144 -> took=651ms length: 5124
20:48:43.240 -> Fetching content length from httpforever.com:80
20:48:43.903 -> took=648ms length: 5124
20:48:43.998 -> Fetching content length from httpforever.com:80
20:48:44.659 -> took=654ms length: 5124
20:48:44.752 -> Fetching content length from httpforever.com:80
20:48:45.367 -> took=644ms length: 5124
20:48:45.460 -> Fetching content length from httpforever.com:80
20:48:46.123 -> took=647ms length: 5124
20:48:46.217 -> Fetching content length from httpforever.com:80
20:48:46.879 -> took=646ms length: 5124
20:48:46.975 -> Fetching content length from httpforever.com:80
20:48:47.636 -> took=649ms length: 5124
20:48:47.731 -> Fetching content length from httpforever.com:80
20:48:48.392 -> took=644ms length: 5124
20:48:48.485 -> Fetching content length from httpforever.com:80
20:48:49.147 -> took=653ms length: 5124
20:48:49.241 -> Fetching content length from httpforever.com:80
20:48:49.855 -> took=650ms length: 5124
20:48:49.997 -> Fetching content length from httpforever.com:80
20:48:50.610 -> took=654ms length: 5124
20:48:50.751 -> Fetching content length from httpforever.com:80
20:48:51.364 -> took=646ms length: 5124
20:48:51.459 -> Fetching content length from httpforever.com:80
20:48:52.119 -> took=645ms length: 5124
20:48:52.214 -> Fetching content length from httpforever.com:80
20:48:52.876 -> took=650ms length: 5124
20:48:52.971 -> Fetching content length from httpforever.com:80
20:48:53.633 -> took=650ms length: 5124
20:48:53.728 -> Fetching content length from httpforever.com:80
20:48:54.389 -> took=645ms length: 5124
20:48:54.483 -> Fetching content length from httpforever.com:80
20:48:55.141 -> took=654ms length: 5124
20:48:55.233 -> Fetching content length from httpforever.com:80
20:48:55.897 -> took=652ms length: 5124
20:48:55.992 -> Fetching content length from httpforever.com:80
20:48:56.652 -> took=649ms length: 5124
20:48:56.746 -> Fetching content length from httpforever.com:80
20:48:57.360 -> took=651ms length: 5124
20:48:57.501 -> Fetching content length from httpforever.com:80
20:48:58.119 -> took=646ms length: 5124
20:48:58.260 -> Fetching content length from httpforever.com:80
20:48:58.876 -> took=654ms length: 5124
20:48:59.018 -> Fetching content length from httpforever.com:80
20:48:59.674 -> took=653ms length: 5124
20:48:59.768 -> Fetching content length from httpforever.com:80
schnoberts1 commentedon Oct 1, 2024
This makes sense as I think the increased MSS results in the issue being masked.
megacct commentedon Oct 1, 2024
Just re-ran your sketch (even thoguh after bedtime) with stock mbedlib and just the SOCKET_BUFFER_SIZE override. It works. Please try
p.s. should probably be larger and declared static in arduino::MbedClient::readSocket for better perf but I've other things on the go
schnoberts1 commentedon Oct 1, 2024
pennam commentedon Oct 2, 2024
just increasing
SOCKET_BUFFER_SIZE
seems to not fix the issue in my casemegacct commentedon Oct 2, 2024
My apologies, you are correct. It must not have used the stock lib (it was late)
schnoberts1 commentedon Oct 2, 2024
I looked at the packet capture (increased MSS). The increased MSS size results in a larger window being advertised by the Giga. It advertises a 5840 byte window which is larger than the payload that's being sent to it. We get no ZeroWindow. It also means that the http server can send the last bit of data with the FIN state set so when the RST/ACK is sent the socket is already presumably half closed as the Giga got a FIN.
Without this increase the server can't send all the data, the FIN state is not sent, the RST/ACK is sent on an open connection. The Server doesn't abort its side of the connection and sends Keep Alives (which I don't think is right, but the TCP stack is supposed to behave well when things aren't quire right). The Giga's stack is processing these keep alives despite the fact it's aborted the connection. I don't think this is correct. The result is the socket being held open incorrectly. I propose this is the issue and what we see with increased MSS (etc) is a mask. I can't build a custom mbed build at the moment, but I imagine testing this hypothesis would involve an http server with keep-alive that responds like the one I chose, but sends a much larger http body content, triggering the zero window, no FIN etc.
To sort of demonstrate this, if I don't just call contentLength() but instead drain the connection by calling responseBody() then the problem goes away, because the FIN can be sent, RST is on a half closed socket and the Keep Alives aren't set.
The reason I keep coming back to this 'MSS is a mask' is because I think there's a defect here and the defect isn't "wrong MSS" but rather wrong behaviour under the specific situations the small MSS creates relative to this web server.
Does that make sense?
You might ask why I care? Well, it's because the stack is not freeing resources when it should (in my opinion) and the internet is full of servers that do weird things TCP wise and I want my system to be robust in the face of those and I'm not convinced that the MSS is really addressing the core issue.
That being said, a more standard MSS for a machine like the Giga which has plenty of memory does seem like a good change irrespective.
pennam commentedon Oct 2, 2024
it makes perfectly sense, I've done other tests with the increased MSS and trying to download bigger files... know what? the issue comes back again, so i agree with you increasing MSS is masking the issue and is not the fix.
schnoberts1 commentedon Oct 2, 2024
I think at this stage it might be worth confirming with the LWIP crowd what they expect to happen in the situation given the PCAP showing the failure mode. Maybe I am wrong and this is the expected behaviour and it's fine because there's a timeout that handles it eventually.
To some extent I think you could triage this issue away if you were inclined to on the basis of:
... but it'd be nice to get a response from the LWIP community (I assume this is where the issue might lie, or perhaps not.... skimming the LWIP code it does look like resource freeing happens on a RST and also shortly after in its background thread that cleans up dead connections - that's an every 0.5s/0.25s interval thread though, not a 5s interval).
I should probably change the title of this ticket to be clearer about the situation.
BTW: Obv when I fully drain the connection by calling responseBody() on HttpClient the problem doesn't occur.
[-]Every 5th TCP connection fails on Arduino Giga R1.[/-][+]Every 5th TCP connection fails on Arduino Giga R1 if we don't fully read the HTTP content sent from the server each time[/+]schnoberts1 commentedon Oct 2, 2024
... I guess the counter argument is that there are plenty of situations where you might want to abandon your connection while the server is still writing data, e.g. validation of content fails but we have a workaround, always consume everything up until you think it's safe to close the connection (i.e. you are likely caught up), even in that situation. Obv won't work for streaming data I imagine.
schnoberts1 commentedon Oct 2, 2024
Also, I do appreciate the effort put into looking at this. It's helped a lot.