Skip to content

Possible bug ESP8266WebServer::handleClient method #8941

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
6 tasks done
supersjellie opened this issue Jun 11, 2023 · 7 comments
Closed
6 tasks done

Possible bug ESP8266WebServer::handleClient method #8941

supersjellie opened this issue Jun 11, 2023 · 7 comments

Comments

@supersjellie
Copy link
Contributor

supersjellie commented Jun 11, 2023

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: [ESP-12|ESP-01|
  • Core Version: 3.1.2
  • Development Env: [Arduino IDE]
  • Operating System: [Windows]

Settings in IDE

  • Module: [Generic ESP8266 Module|Nodemcu]
  • Flash Mode: [qio|dio|other]???
  • Flash Size: [4MB/1MB]
  • lwip Variant: [v1.4|]
  • Reset Method: [ck|nodemcu]
  • Flash Frequency: [40Mhz]
  • CPU Frequency: [80Mh]
  • Upload Using: [OTA|SERIAL]
  • Upload Speed: [115200

Problem Description

I've got an openHAB system and use five devices constructed with either an ESP01 or a NodeMCU (switches/sensors). openHAB communicates by calling and url and the ESP device returns a JSON string with the response. Usually working fast and stable BUT the openHAB log shows regular timeouts of the ESP's. This can also be reproduced by using a browser in developer mode. Refreshing usually takes 20-60 ms but frequently 1000-3000 ms responses OR 60.000 ms fails occur. The timeouts don't cause a reboot of the ESP, they keep running and the next call is usually succesfull.

I think the source of the issue is using a variable "client" to copy an object to "_currentClient". I'm no expert but learned to avoid this construction, something to do with the copy being a crippled version of the original object. Replacing the code at the start of the ESP8266WebServer-impl.h handleClient method (put it in sketch) to avoid this construction makes it stable for me. Quite simple, there seems not to be a real reason to use the client variable. Adjusted to use _currentClient from the start. Before this adjustment 1 to 10-20 of the calls produced timeout errors in the openHAB log. After this, they've completely gone (testperiod 24 hrs).

So maybe worth investigating. I also have ESP32 devices and they use the same code and also suffer the timeouts. I've not tested it yet, but maybe the same modification may apply here.

MCVE Sketch

void ESP8266WebServerTemplate<ServerType>::handleClient() {

 if (_currentStatus == HC_NONE) {
    //ClientType client = _server.accept();
    _currentClient= _server.accept();
    if (!_currentClient) {
      return;
    }

    DBGWS("New client\n");

    //_currentClient = client;
    _currentStatus = HC_WAIT_READ;
    _statusChange = millis();
        
  }

Debug Messages

@supersjellie
Copy link
Contributor Author

Update. Did the same modification to the ESP32 handleClient method (in WebServer.cpp). And indeed, this solved all timeouts of the 3 ESP32C3 devices I have. It has run only 9 hours but usually each ESP32 logs 3 to 4 timeouts every hour.

@earlephilhower
Copy link
Collaborator

Would you like to submit a PR @supersjellie, since you have a failing test case and a code change which you've verified to make it work properly?

@supersjellie
Copy link
Contributor Author

Would you like to submit a PR

That's ok. Monday, so work started again. It will take some time, probably later this week. I'll do the same for the ESP32 library

@supersjellie
Copy link
Contributor Author

@earlephilhower it was a while ago I created one but I think I got it => #8944

Tomorrow I'll also put one in for the ESP32 (including an issue since it's another repo). Excellent news, it's working even better that I would have hoped. So far (three days running) not even 1 timeout. All calls from all ESP devices successful! Even a small reduction on the average response time since the 'peaks' are gone.

@mcspr
Copy link
Collaborator

mcspr commented Jun 16, 2023

I think the source of the issue is using a variable "client" to copy an object to "_currentClient". I'm no expert but learned to avoid this construction, something to do with the copy being a crippled version of the original object. Replacing the code at the start of the ESP8266WebServer-impl.h handleClient method (put it in sketch) to avoid this construction makes it stable for me. Quite simple, there seems not to be a real reason to use the client variable. Adjusted to use _currentClient from the start. Before this adjustment 1 to 10-20 of the calls produced timeout errors in the openHAB log. After this, they've completely gone (testperiod 24 hrs).

For clarification - WiFiClient simply holds 'ownership' of the underlying connection. Copying it would simply copy a pointer and increment the internal reference counter, so it is pretty cheap thing to do. Same for WiFiClientSecure, where extra data buffers are also shared between every copy of the object.

What seems to be the issue is never removing _currentClient, so it is kept as-is for as long as accept() never returns a new connection. With the updated code, it is replaced with an empty WiFiClient() and forces system to actually close the connection much sooner (see above, reference counter becomes 0).
Also related to #8904 and #8905, where flush() was to used to force network lib to process it instead of holding onto it

(60s second timeout sounds pretty close to TCP_MSL defined in lwip, where it waits for LAST-ACK. other ones seem to be related to our own data handling, where either 2s or 5s are used)

@supersjellie
Copy link
Contributor Author

Well, I'm no c++/Arduino expert. As stated I read it was bad practice. And this code change resolved the 10% timeouts I experienced. So it worked like a charm. Running for almost a week now, and still not a single timeout in the log.

@supersjellie
Copy link
Contributor Author

supersjellie commented Jun 19, 2023

Just to let you know, after ten days of running with the modified code one of the ESP devices produced the first (and so far only) timeout. Still a pretty good result.

2023-06-19 08:35:17.890 [WARN ] [p.internal.http.HttpResponseListener] - Requesting 'http://switch01/buttons' (method='GET', content='null') failed: java.util.concurrent.TimeoutException: Total timeout 3000 ms elapsed

hasenradball pushed a commit to hasenradball/Arduino that referenced this issue Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants