Skip to content

Every 5th TCP connection fails on Arduino Giga R1 if we don't fully read the HTTP content sent from the server each time #937

Open
@schnoberts1

Description

@schnoberts1

Platform: Giga R1
Arduino Core Mbed 4.1.5
HttpClient: 0.6.1

Here's a sequence of events with a crudely instrumented MbedClient. We connect using ArduinoHttpClient every 1s. Every 5th connection fails with error -3005 from static_cast<TCPSocket *>(sock)->open(getNetwork()); in the MbedClient::connect() call. When it fails it takes 30s to return. This is in version 4.1.5. In version 4.1.1 it would never recover and continue to fail until device reboot. There would be no 30s delay. This issue occurs on any http server I've tried. I can continue to connect to the server in the code with python on my Mac with no problems. If I extend the delay so I connect to the server once every 5s the problem goes away. At 4s it reappears.

22:10:12.229 -> Fetching content length from httpforever.com:80
22:10:12.393 -> Return code from connect0
22:10:13.213 -> took=993ms length: 5124
22:10:15.229 -> Fetching content length from httpforever.com:80
22:10:15.394 -> Return code from connect0
22:10:16.216 -> took=979ms length: 5124
22:10:18.198 -> Fetching content length from httpforever.com:80
22:10:18.365 -> Return code from connect0
22:10:19.225 -> took=1008ms length: 5124
22:10:21.203 -> Fetching content length from httpforever.com:80
22:10:21.366 -> Return code from connect0
22:10:22.222 -> took=1004ms length: 5124
22:10:24.202 -> Fetching content length from httpforever.com:80
22:10:24.202 -> Return code from TCPSocket::open-3005
22:10:54.295 -> took=30099ms length: -1

code

#include <Arduino.h>
#include <WiFi.h>
#include <HttpClient.h>

void setup() {
  Serial.begin(115200);
  delay(1000);
  Serial.println("HI");
  // put your setup code here, to run once:
  auto wifi = WiFiInterface::get_default_instance();
  wifi->set_dhcp(true);
  wifi->set_credentials("ZZZZ", "ZZZZ", nsapi_security_t::NSAPI_SECURITY_WPA2);
  Serial.println(wifi->connect());
}

void loop() {
  // put your main code here, to run repeatedly:
  WiFiClient client;
 
   HttpClient httpClient(client, "httpforever.com", 80);
   Serial.println("Fetching content length from httpforever.com:80");
   auto then = millis();
   httpClient.beginRequest();
   httpClient.get("/");
   httpClient.endRequest();
   auto len = httpClient.contentLength(); // Doesn't read all content
   Serial.print("took="); Serial.print((millis() - then)); Serial.print("ms "); Serial.print("length: "); Serial.println(len);
   delay(5000);
}

Activity

schnoberts1

schnoberts1 commented on Aug 15, 2024

@schnoberts1
Author

... at 4.5s delay it takes 8 calls before it fails, an interesting multiple of 4. Any hints? I tried moving WifiClient globally but that swaps -3005 errors for -3003 errors on the first re-use:

22:25:36.246 -> Fetching content length from httpforever.com:80
22:25:36.448 -> Return code from connect0
22:25:37.243 -> took=1001ms length: 5124
22:25:38.260 -> Fetching content length from httpforever.com:80
22:25:38.260 -> Return code from TCPSocket::open-3003
22:26:09.251 -> took=31000ms length: -1

31s timeout this time.

Since I've included the code this should replicate well.

schnoberts1

schnoberts1 commented on Aug 15, 2024

@schnoberts1
Author

... I am also guessing this 30s timeout is in HttpClient since there's no delay on the return from the failed connect.

schnoberts1

schnoberts1 commented on Aug 16, 2024

@schnoberts1
Author

Note: calling client.stop() does not resolve the issue

stopandchatchfire

stopandchatchfire commented on Sep 5, 2024

@stopandchatchfire

The maximum number of TCP connections is 4.

As defined in the file: "mbed_config.h"

#define MBED_CONF_LWIP_TCP_SOCKET_MAX 4

I don't understand why it's so low. Need to recompile locally to have more connections available.

schnoberts1

schnoberts1 commented on Sep 5, 2024

@schnoberts1
Author

I never have 4 tcp sockets open but I am confident this limit is at the core of the issue. Some clean up isn’t being executed. Increasing the socket count isn’t going to help though as it just puts the issue off.

JAndrassy

JAndrassy commented on Sep 9, 2024

@JAndrassy
Contributor
schnoberts1

schnoberts1 commented on Sep 9, 2024

@schnoberts1
Author

Of course @JAndrassy. I've applied the patch and the issue still remains.

pennam

pennam commented on Oct 1, 2024

@pennam
Contributor

@schnoberts1 the 30s timeout comes from the Http library, you can get rid of it adding httpClient.setHttpResponseTimeout(1000); to your sketch.

I'm still investigating about the NSAPI_ERROR_NO_SOCKET return value but i agree it is related to the MBED_CONF_LWIP_TCP_SOCKET_MAX define and in the way lwip TCP connections are deleted https://github.com/arduino/mbed-os/blob/e0cad5c4277b3c5ee9cb01a9529df9333612bbeb/connectivity/lwipstack/lwip/src/api/lwip_api_lib.c#L233

It looks that before complete deletion they can be put in a waitstate and closed afterwards

https://github.com/arduino/mbed-os/blob/e0cad5c4277b3c5ee9cb01a9529df9333612bbeb/connectivity/lwipstack/lwip/src/api/lwip_api_msg.c#L913

JAndrassy

JAndrassy commented on Oct 1, 2024

@JAndrassy
Contributor

does the HttpClient send the "Connection: close" header?

pennam

pennam commented on Oct 1, 2024

@pennam
Contributor

@JAndrassy yes HttpClient sends the "Connection: close" header.

I think I've found the rootcause, and it is server side, in fact httpforever.com keeps sending TCP Keep-Alive messages, and thus connections are not closed. Using a different server like example.org connections are closed correctly and the NSAPI_ERROR_NO_SOCKET error is not generated..

schnoberts1

schnoberts1 commented on Oct 1, 2024

@schnoberts1
Author
pennam

pennam commented on Oct 1, 2024

@pennam
Contributor

Here is what i get with wireshark, I'm now thinking if the TCP Window Full messages are somehow involved
wireshark.zip

pennam

pennam commented on Oct 1, 2024

@pennam
Contributor

I get the same NSAPI_ERROR_NO_SOCKET using anothe http site sneaindia.com TCP Window Full
messages are also present.
snea.pcapng.zip

schnoberts1

schnoberts1 commented on Oct 1, 2024

@schnoberts1
Author

5 remaining items

schnoberts1

schnoberts1 commented on Oct 1, 2024

@schnoberts1
Author
megacct

megacct commented on Oct 1, 2024

@megacct
Contributor

FYI, In addition to the mbed conf settings mentioned here #966 (comment), I've also set
compiler.c.extra_flags=-DSOCKET_BUFFER_SIZE=2048
compiler.cpp.extra_flags=-DSOCKET_BUFFER_SIZE=2048
in my platform local

megacct

megacct commented on Oct 1, 2024

@megacct
Contributor

Hi @schnoberts1, I ran your sketch on my setup for 5mins with 5 sec delay, 5mins with 1 sec delay and for 2mins with 100ms delay - output attached.
output.zip

100ms

20:48:23.390 -> HI
20:48:29.958 -> 0
20:48:29.958 -> Fetching content length from httpforever.com:80
20:48:31.140 -> took=1178ms length: 5124
20:48:31.233 -> Fetching content length from httpforever.com:80
20:48:31.852 -> took=645ms length: 5124
20:48:31.947 -> Fetching content length from httpforever.com:80
20:48:32.608 -> took=641ms length: 5124
20:48:32.702 -> Fetching content length from httpforever.com:80
20:48:33.361 -> took=649ms length: 5124
20:48:33.455 -> Fetching content length from httpforever.com:80
20:48:34.117 -> took=646ms length: 5124
20:48:34.212 -> Fetching content length from httpforever.com:80
20:48:34.825 -> took=642ms length: 5124
20:48:34.966 -> Fetching content length from httpforever.com:80
20:48:35.579 -> took=645ms length: 5124
20:48:35.673 -> Fetching content length from httpforever.com:80
20:48:36.337 -> took=650ms length: 5124
20:48:36.432 -> Fetching content length from httpforever.com:80
20:48:37.094 -> took=647ms length: 5124
20:48:37.189 -> Fetching content length from httpforever.com:80
20:48:37.851 -> took=650ms length: 5124
20:48:37.946 -> Fetching content length from httpforever.com:80
20:48:38.608 -> took=652ms length: 5124
20:48:38.702 -> Fetching content length from httpforever.com:80
20:48:39.364 -> took=671ms length: 5124
20:48:39.460 -> Fetching content length from httpforever.com:80
20:48:40.120 -> took=654ms length: 5124
20:48:40.215 -> Fetching content length from httpforever.com:80
20:48:40.877 -> took=647ms length: 5124
20:48:40.972 -> Fetching content length from httpforever.com:80
20:48:41.628 -> took=645ms length: 5124
20:48:41.723 -> Fetching content length from httpforever.com:80
20:48:42.385 -> took=665ms length: 5124
20:48:42.481 -> Fetching content length from httpforever.com:80
20:48:43.144 -> took=651ms length: 5124
20:48:43.240 -> Fetching content length from httpforever.com:80
20:48:43.903 -> took=648ms length: 5124
20:48:43.998 -> Fetching content length from httpforever.com:80
20:48:44.659 -> took=654ms length: 5124
20:48:44.752 -> Fetching content length from httpforever.com:80
20:48:45.367 -> took=644ms length: 5124
20:48:45.460 -> Fetching content length from httpforever.com:80
20:48:46.123 -> took=647ms length: 5124
20:48:46.217 -> Fetching content length from httpforever.com:80
20:48:46.879 -> took=646ms length: 5124
20:48:46.975 -> Fetching content length from httpforever.com:80
20:48:47.636 -> took=649ms length: 5124
20:48:47.731 -> Fetching content length from httpforever.com:80
20:48:48.392 -> took=644ms length: 5124
20:48:48.485 -> Fetching content length from httpforever.com:80
20:48:49.147 -> took=653ms length: 5124
20:48:49.241 -> Fetching content length from httpforever.com:80
20:48:49.855 -> took=650ms length: 5124
20:48:49.997 -> Fetching content length from httpforever.com:80
20:48:50.610 -> took=654ms length: 5124
20:48:50.751 -> Fetching content length from httpforever.com:80
20:48:51.364 -> took=646ms length: 5124
20:48:51.459 -> Fetching content length from httpforever.com:80
20:48:52.119 -> took=645ms length: 5124
20:48:52.214 -> Fetching content length from httpforever.com:80
20:48:52.876 -> took=650ms length: 5124
20:48:52.971 -> Fetching content length from httpforever.com:80
20:48:53.633 -> took=650ms length: 5124
20:48:53.728 -> Fetching content length from httpforever.com:80
20:48:54.389 -> took=645ms length: 5124
20:48:54.483 -> Fetching content length from httpforever.com:80
20:48:55.141 -> took=654ms length: 5124
20:48:55.233 -> Fetching content length from httpforever.com:80
20:48:55.897 -> took=652ms length: 5124
20:48:55.992 -> Fetching content length from httpforever.com:80
20:48:56.652 -> took=649ms length: 5124
20:48:56.746 -> Fetching content length from httpforever.com:80
20:48:57.360 -> took=651ms length: 5124
20:48:57.501 -> Fetching content length from httpforever.com:80
20:48:58.119 -> took=646ms length: 5124
20:48:58.260 -> Fetching content length from httpforever.com:80
20:48:58.876 -> took=654ms length: 5124
20:48:59.018 -> Fetching content length from httpforever.com:80
20:48:59.674 -> took=653ms length: 5124
20:48:59.768 -> Fetching content length from httpforever.com:80

schnoberts1

schnoberts1 commented on Oct 1, 2024

@schnoberts1
Author

This makes sense as I think the increased MSS results in the issue being masked.

megacct

megacct commented on Oct 1, 2024

@megacct
Contributor

Just re-ran your sketch (even thoguh after bedtime) with stock mbedlib and just the SOCKET_BUFFER_SIZE override. It works. Please try
p.s. should probably be larger and declared static in arduino::MbedClient::readSocket for better perf but I've other things on the go

schnoberts1

schnoberts1 commented on Oct 1, 2024

@schnoberts1
Author
pennam

pennam commented on Oct 2, 2024

@pennam
Contributor

just increasing SOCKET_BUFFER_SIZE seems to not fix the issue in my case

megacct

megacct commented on Oct 2, 2024

@megacct
Contributor

My apologies, you are correct. It must not have used the stock lib (it was late)

schnoberts1

schnoberts1 commented on Oct 2, 2024

@schnoberts1
Author

I looked at the packet capture (increased MSS). The increased MSS size results in a larger window being advertised by the Giga. It advertises a 5840 byte window which is larger than the payload that's being sent to it. We get no ZeroWindow. It also means that the http server can send the last bit of data with the FIN state set so when the RST/ACK is sent the socket is already presumably half closed as the Giga got a FIN.

Without this increase the server can't send all the data, the FIN state is not sent, the RST/ACK is sent on an open connection. The Server doesn't abort its side of the connection and sends Keep Alives (which I don't think is right, but the TCP stack is supposed to behave well when things aren't quire right). The Giga's stack is processing these keep alives despite the fact it's aborted the connection. I don't think this is correct. The result is the socket being held open incorrectly. I propose this is the issue and what we see with increased MSS (etc) is a mask. I can't build a custom mbed build at the moment, but I imagine testing this hypothesis would involve an http server with keep-alive that responds like the one I chose, but sends a much larger http body content, triggering the zero window, no FIN etc.

To sort of demonstrate this, if I don't just call contentLength() but instead drain the connection by calling responseBody() then the problem goes away, because the FIN can be sent, RST is on a half closed socket and the Keep Alives aren't set.

The reason I keep coming back to this 'MSS is a mask' is because I think there's a defect here and the defect isn't "wrong MSS" but rather wrong behaviour under the specific situations the small MSS creates relative to this web server.

Does that make sense?

You might ask why I care? Well, it's because the stack is not freeing resources when it should (in my opinion) and the internet is full of servers that do weird things TCP wise and I want my system to be robust in the face of those and I'm not convinced that the MSS is really addressing the core issue.

That being said, a more standard MSS for a machine like the Giga which has plenty of memory does seem like a good change irrespective.

pennam

pennam commented on Oct 2, 2024

@pennam
Contributor

Does that make sense?

it makes perfectly sense, I've done other tests with the increased MSS and trying to download bigger files... know what? the issue comes back again, so i agree with you increasing MSS is masking the issue and is not the fix.

schnoberts1

schnoberts1 commented on Oct 2, 2024

@schnoberts1
Author

I think at this stage it might be worth confirming with the LWIP crowd what they expect to happen in the situation given the PCAP showing the failure mode. Maybe I am wrong and this is the expected behaviour and it's fine because there's a timeout that handles it eventually.

To some extent I think you could triage this issue away if you were inclined to on the basis of:

  1. The connection is torn down eventually.
  2. Always read the entire stream from the server and if you don't then don't expect the stack to free resources in a timely fashion.
  3. It is a somewhat contrived example.

... but it'd be nice to get a response from the LWIP community (I assume this is where the issue might lie, or perhaps not.... skimming the LWIP code it does look like resource freeing happens on a RST and also shortly after in its background thread that cleans up dead connections - that's an every 0.5s/0.25s interval thread though, not a 5s interval).

I should probably change the title of this ticket to be clearer about the situation.

BTW: Obv when I fully drain the connection by calling responseBody() on HttpClient the problem doesn't occur.

changed the title [-]Every 5th TCP connection fails on Arduino Giga R1.[/-] [+]Every 5th TCP connection fails on Arduino Giga R1 if we don't fully read the HTTP content sent from the server each time[/+] on Oct 2, 2024
schnoberts1

schnoberts1 commented on Oct 2, 2024

@schnoberts1
Author

... I guess the counter argument is that there are plenty of situations where you might want to abandon your connection while the server is still writing data, e.g. validation of content fails but we have a workaround, always consume everything up until you think it's safe to close the connection (i.e. you are likely caught up), even in that situation. Obv won't work for streaming data I imagine.

schnoberts1

schnoberts1 commented on Oct 2, 2024

@schnoberts1
Author

Also, I do appreciate the effort put into looking at this. It's helped a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      Participants

      @schnoberts1@JAndrassy@pennam@megacct@stopandchatchfire

      Issue actions

        Every 5th TCP connection fails on Arduino Giga R1 if we don't fully read the HTTP content sent from the server each time · Issue #937 · arduino/ArduinoCore-mbed