Skip to content

ESP8266WebServer: The problem with frequent simultaneous connections (early response closing, not timeout) #8904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 of 6 tasks
jjsuwa-sys3175 opened this issue Apr 5, 2023 · 9 comments

Comments

@jjsuwa-sys3175
Copy link
Contributor

jjsuwa-sys3175 commented Apr 5, 2023

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

Settings in IDE

  • Module: [Generic ESP8266 Module]
  • Flash Mode: [qio]
  • Flash Size: [4MB]
  • lwip Variant: [v2 Higher Bandwidth]
  • Reset Method: [nodemcu]
  • Flash Frequency: [80Mhz]
  • CPU Frequency: [160MHz]
  • Upload Using: [SERIAL]
  • Upload Speed: [512000]

Problem Description

A moderate rate of unexpected HTTP (or socket) disconnections/timeouts is observed when trying to connect to the ESP8266WebServer at a high frequency.
But other HTTP servers belonging to the same WiFi do not see such disconnections/timeouts.

MCVE Sketch

#define WIFI_SSID     "YOUR-WIFI-SSID"
#define WIFI_PASSWORD "YOUR-WIFI-PASSWORD"

#include <ESP8266WiFi.h>
#include <ESP8266WebServer.h>

ESP8266WebServer webServer(80);

void setup() {
  Serial.begin(74880);
  delay(1000);
  Serial.print(F("\n\nconnecting to WiFi"));
  WiFi.mode(WIFI_STA);
  WiFi.begin(F(WIFI_SSID), F(WIFI_PASSWORD));
  for (; ; Serial.print('.'), delay(333))
    if (WiFi.status() == WL_CONNECTED)
      break;
  Serial.println(F(", done."));
  String addr = WiFi.localIP().toString();
  webServer.onNotFound([]() {
    webServer.send(404);
  });
  webServer.on(F("/"), HTTP_GET, []() {
    StreamString s;
    s.reserve(40);
    s.print(F("http://"));
    WiFi.localIP().printTo(s);
    s.print(F("/index.html"));
    webServer.sendHeader(F("Location"), s);
    webServer.send(302);
  });
  Serial.printf_P(PSTR("webServer: http://%s/\n"), addr.c_str());
  webServer.on(F("/index.html"), HTTP_GET, []() {
    webServer.send(200, F("text/html"), F("<!DOCTYPE html><html>"
                                          "<head><title>WiFi HTTP server</title></head>"
                                          "<body><h1>WiFi HTTP server</h1><p>hello world</p></body>"
                                          "</html>"));
  });
  Serial.printf_P(PSTR("webServer: http://%s/index.html\n"), addr.c_str());
  webServer.on(F("/blob.bin"), HTTP_GET, []() {
    size_t length = (ESP.random() & 4095) + 4096;
    auto buffer = new uint8_t[length];
    ESP.random(buffer, length);
    webServer.send(200, PSTR("application/octet-stream"), buffer, length);
    delete[] buffer;
  });
  Serial.printf_P(PSTR("webServer: http://%s/blob.bin\n"), addr.c_str());
  webServer.begin();
}

void loop() {
  webServer.handleClient();
}

Detailed Procedure

  1. Build the above MVCE and upload it to the chip.
  2. After the chip reboots and connects to your WiFi, you should get a serial output similar to:
connecting to WiFi................, done.
webServer: http://192.168.2.83/
webServer: http://192.168.2.83/index.html
webServer: http://192.168.2.83/blob.bin
  1. Now, let's use the HTTP load test program "siege" from a linux PC belonging to the same WiFi network to send high-frequency HTTP requests to the chip. First, write something like this in a text file:
http://192.168.2.83/
http://192.168.2.83/index.html
http://192.168.2.83/blob.bin
  1. Next, run "siege" in benchmark mode for a some time (a one hour, for example). You'll probably see some alerts in the near future...
vboxuser@Bullseye64:/tmp$ siege -bt 1H -f siege.txt > /dev/null 
** SIEGE 4.0.7
** Preparing 25 concurrent users for battle.
The server is now under siege...
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
[alert] socket: select and discovered it's not ready sock.c:384: Connection timed out
[alert] socket: read check timed out(30) sock.c:273: Connection timed out
...
  1. Just to be clear, other HTTP servers (Apache, nginx, or even a simple server written in python) don't behave this way. Also, changing the "lwip Variant" setting does the same.
  2. After an hour the "siege" summary looks roughly like this:
Lifting the server siege...
Transactions:                 559940 hits
Availability:                  99.81 %
Elapsed time:                3599.58 secs
Data transferred:             855.06 MB
Response time:                  0.10 secs
Transaction rate:             155.56 trans/sec
Throughput:                     0.24 MB/sec
Concurrency:                   15.94
Successful transactions:      560229
Failed transactions:            1088
Longest transaction:           15.92
Shortest transaction:           0.00

Whether or not a failure rate of 0.19% is appropriate is left to the judgment of each person...

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 5, 2023

The good news is that the esp did not freeze :)

There is no appropriate api in webserver to call wificlient::begin(port, tcp-backlog) but you have tried with other lwIP variants, did this include the "(no features)" variant which has tcp-backlog disabled ?

Otherwise you can try with an increased value of this define :

#define MAX_PENDING_CLIENTS_PER_PORT 5

What I guess happens is that there are too many requests waiting. Starting from the 6th client requesting, lwIP does not anymore answer "please come in and wait for your turn" so the client sees as if there is no server behind the door. It is normally keeping trying for a little while (time telnet 192.168.42.42 1234 gives 2mn10s on ubuntu22.04) then fails because lwIP has not opened the door during that time. This should be the origin of the 0.19% of failure.

@jjsuwa-sys3175
Copy link
Contributor Author

The good news is that the esp did not freeze :)

I agree with that :)

There is no appropriate api in webserver to call wificlient::begin(port, tcp-backlog) but you have tried with other lwIP variants, did this include the "(no features)" variant which has tcp-backlog disabled ?

It won't timeout, but instead:

[error] socket: read error Connection reset by peer sock.c:635: Connection reset by peer
[error] socket: read error Connection reset by peer sock.c:635: Connection reset by peer
[error] socket: unable to connect sock.c:282: Connection refused
[error] socket: unable to connect sock.c:282: Connection refused

Considering that rejected without being put in the backlog queue, it seems to be a reasonable result.

OK, so let's define MAX_PENDING_CLIENTS_PER_PORT to a number (say 100) significantly larger than the number of concurrent "siege" connections (which is set to 25) for v2 Higher Bandwidth... But in conclusion, no change.

No, it's not right to force excessive simultaneous connections in the first place...
Try removing #define MAX_PENDING_CLIENTS_PER_PORT and limiting the number of simultaneous connections for "siege" to 2... Hmm, seems almost no problem in a hour:
v2 Higher Bandwidth:

  • ** SIEGE 4.0.7
    ** Preparing 2 concurrent users for battle.
    The server is now under siege...
    
    Lifting the server siege...
    Transactions:                 493629 hits
    Availability:                 100.00 %
    Elapsed time:                3599.71 secs
    Data transferred:             752.44 MB
    Response time:                  0.01 secs
    Transaction rate:             137.13 trans/sec
    Throughput:                     0.21 MB/sec
    Concurrency:                    1.99
    Successful transactions:      493631
    Failed transactions:              20
    Longest transaction:            2.64
    Shortest transaction:           0.00
    

v2 Lower Memory (no features):

  • ** SIEGE 4.0.7
    ** Preparing 2 concurrent users for battle.
    The server is now under siege...
    
    Lifting the server siege...
    Transactions:                 326131 hits
    Availability:                 100.00 %
    Elapsed time:                3599.60 secs
    Data transferred:             497.17 MB
    Response time:                  0.02 secs
    Transaction rate:              90.60 trans/sec
    Throughput:                     0.14 MB/sec
    Concurrency:                    1.99
    Successful transactions:      326133
    Failed transactions:               5
    Longest transaction:            1.90
    Shortest transaction:           0.00
    

@d-a-v
Copy link
Collaborator

d-a-v commented Apr 5, 2023

so let's define MAX_PENDING_CLIENTS_PER_PORT to a number (say 100)

100 is like in the "no feature" selection, like infinity, and will fail / freeze / reboot because of OOM on high load.

  • Connection reset means that the esp has indeed frozen, rebooted and is back to life, but the opened, accepted and waiting socket on webclient side has been discarded by the rebooted esp
  • Connection refused means that esp is rebooting, network is up but not yet the webserver

These cases show that

  • MAX_PENDING_CLIENTS_PER_PORT has to be lowered and/or
  • "lwIP-lower-memory" has to be selected if "higher-bandwidth" is the current selection

You can try to monitor with getFreeHeap() and alert when the max free chunk is too low (<10KB?) or fragmentation too high.

Try removing #define MAX_PENDING_CLIENTS_PER_PORT and limiting the number of simultaneous connections for "siege" to 2...

It should give the same result as when defining MAX_PENDING_CLIENTS_PER_PORT to 1

@jjsuwa-sys3175
Copy link
Contributor Author

so let's define MAX_PENDING_CLIENTS_PER_PORT to a number (say 100)

100 is like in the "no feature" selection, like infinity, and will fail / freeze / reboot because of OOM on high load.

Oh, aha, I see, the backlog obviously needs memory :)

You can try to monitor with getFreeHeap() and alert when the max free chunk is too low (<10KB?) or fragmentation too high.

I think it would be very helpful if there was something in the Debug Level setting that would reveal this.

@jjsuwa-sys3175
Copy link
Contributor Author

so let's define MAX_PENDING_CLIENTS_PER_PORT to a number (say 100)

100 is like in the "no feature" selection, like infinity, and will fail / freeze / reboot because of OOM on high load.

Oh, aha, I see, the backlog obviously needs memory :)

It doesn't look like memory is a constraint... Modify the above MVCE loop() as follows:

void loop() {
  webServer.handleClient();
  static size_t prev_l = 0;
  size_t l = (ESP.getFreeHeap() + 1024) / 2048;
  if (l != prev_l) {
    prev_l = l;
    String s;
    s.reserve(40 * 2);
    size_t i;
    for (i = 0; i < l; ++i)
      s += '#';
    for (; i < 40; ++i)
      s += ' ';
    for (i = 0; i < 40; ++i)
      s += '\b';
    Serial.print(s);
  }
}

This allows you to visually track the remaining memory (requires a terminal that can handle backspace). But it seems that there is still enough memory left even during timeouts or other connection failures...

@jjsuwa-sys3175
Copy link
Contributor Author

Well, I apologize for not being able to present clear evidence here, but I would like to assert the following:

  • It is quite rare (in my experience, about less than 0.1%) but certain that connections are abruptly closed before all responses are received from ESP8266WebServer, possibly caused by early-arriving FINs due to IP packet reordering or dropouts.
  • I came up with a simple workaround for the above, which I'm testing right now, but it looks fairly good.
    --- a/libraries/ESP8266WebServer/src/ESP8266WebServer-impl.h
    +++ b/libraries/ESP8266WebServer/src/ESP8266WebServer-impl.h
    @@ -769,6 +769,7 @@ void ESP8266WebServerTemplate<ServerType>::_finalizeResponse() {
       if (_chunked) {
         sendContent(emptyString);
       }
    +  _currentClient.flush();
     }
     
     template <typename ServerType>
    

@jjsuwa-sys3175
Copy link
Contributor Author

Finally, only one early-closing server response has ever been occurred with a quickly-made load testing about 1.6 million HTTP transactions by 5 concurrent connections.

I'm still not able to personally convinced, but I'm going to close the issue. Thanks all.

@jjsuwa-sys3175
Copy link
Contributor Author

jjsuwa-sys3175 commented Apr 12, 2023

There seems to be some kind of problem with the ESP8266WebServer, so I've reopened this issue.

In general, timeouts in opening connections can be safely resolved by retries, but early closed responses after connection establishment cannot.
With that in mind, let's go back to the first MVCE and prepare an HTTP connection load test using the famous Python Requests.

#
if __name__ == "__main__":
    import itertools
    import sys
    import requests
    url = sys.argv[1]
    headers = None  # {"Connection":"close"}
    for n in itertools.count():
        response = requests.get(url, timeout=None, headers=headers)
        print("#%d" % n, end="\r")

No special explanation needed, just a normal Python example of many HTTP connections - in fact, we can issue millions of requests on a single alone connection without any problems (except for timeouts).
However, if we do more than one of the above simultaneously, all but one will eventually abort due to early closing of the response - parhaps taking no more than an hour:

Traceback (most recent call last):
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\urllib3-1.26.15.zip\urllib3\connectionpool.py", line 703, in urlopen
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\urllib3-1.26.15.zip\urllib3\connectionpool.py", line 449, in _make_request
  File "<string>", line 3, in raise_from
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\urllib3-1.26.15.zip\urllib3\connectionpool.py", line 444, in _make_request
  File "http\client.py", line 1344, in getresponse
  File "http\client.py", line 307, in begin
  File "http\client.py", line 276, in _read_status
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\requests-2.28.2.zip\requests\adapters.py", line 489, in send
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\urllib3-1.26.15.zip\urllib3\connectionpool.py", line 787, in urlopen
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\urllib3-1.26.15.zip\urllib3\util\retry.py", line 550, in increment
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\urllib3-1.26.15.zip\urllib3\packages\six.py", line 769, in reraise
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\urllib3-1.26.15.zip\urllib3\connectionpool.py", line 703, in urlopen
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\urllib3-1.26.15.zip\urllib3\connectionpool.py", line 449, in _make_request
  File "<string>", line 3, in raise_from
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\urllib3-1.26.15.zip\urllib3\connectionpool.py", line 444, in _make_request
  File "http\client.py", line 1344, in getresponse
  File "http\client.py", line 307, in begin
  File "http\client.py", line 276, in _read_statusurllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "Z:\torture_requests.py", line 9, in <module>
    response = requests.get(url, timeout=None, headers=headers)
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\requests-2.28.2.zip\requests\api.py", line 73, in get
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\requests-2.28.2.zip\requests\api.py", line 59, in request
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\requests-2.28.2.zip\requests\sessions.py", line 587, in request
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\requests-2.28.2.zip\requests\sessions.py", line 701, in send
  File "C:\Program Files\python-3.8.10-embed\lib\site-packages\requests-2.28.2.zip\requests\adapters.py", line 547, in send
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

What this exception trace suggests is that the connection was lost while trying to receive the status line (yes, the first line) of the HTTP response.

FYI, adding the HTTP/1.1 Connection: close header doesn't help anything.

@jjsuwa-sys3175 jjsuwa-sys3175 changed the title ESP8266WebServer: The problem with frequent connections ESP8266WebServer: The problem with frequent connections (early response closing, not timeout) Apr 12, 2023
@jjsuwa-sys3175 jjsuwa-sys3175 changed the title ESP8266WebServer: The problem with frequent connections (early response closing, not timeout) ESP8266WebServer: The problem with frequent simultaneous connections (early response closing, not timeout) Apr 12, 2023
@jjsuwa-sys3175
Copy link
Contributor Author

#8944 seems to fix this issue as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants