Skip to content

(properly) check whether ClientContext::connect()'s delay is interrupted by disconnected WiFi #4197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

d-a-v
Copy link
Collaborator

@d-a-v d-a-v commented Jan 19, 2018

No description provided.

@igrr
Copy link
Member

igrr commented Jan 22, 2018

May I suggest doing ref() before entering the delay and then unref() after? This way the context will not be deleted, but we will still be able to check the error after the delay. And there would be no need for this extra static flag.

@igrr
Copy link
Member

igrr commented Jan 22, 2018

Upon reading the (existing) code once again, i have a question regarding LwIP.

With lwip 1.4, when WiFi connection was terminated, error callbacks set on PCBs using tcp_err were dispatched before dispatching application level wifi event callbacks. So ClientContext::_error got called, noticed that _connect_pending == true, caused the Arduino task to wake up via esp_schedule, which checked status() and returned an error to the caller. By the time WiFiClient::stopAll was called, _pcb was already NULL.

Now, with lwip2, it seems that the order is different: error callbacks set on PCBs are not being called before the WiFi event callbacks. Could this be changed, instead of introducing this additional logic?

@d-a-v
Copy link
Collaborator Author

d-a-v commented Jan 22, 2018

I remember having considered ref/unref in the first sleepless-PR I made, and I don't remember why I discarded the idea. You are right, this is the proper solution.
Regarding the callback order on WiFi , I will check this soon.

@d-a-v
Copy link
Collaborator Author

d-a-v commented Jan 23, 2018

@igrr

I revisited again this bug without fix.
It does not always SIGV even if the bug is present so I played with os_printf.
In ClientContect::connect(), I print a '?' before the delay, a '!' after the delay, and a '#' in unref near delete.
To trigger easily, I added a second 5000ms delay after the first. This one is never interrupted and I switch the AP off during it. The #4078 sketch has only one connection at a time.
The sequence ?#! is forbidden because this is accessed after delete.
The correct sequence must always be '?!#'.

I get the bad sequence with both lwip (when AP disappear during the second delay).

Are you validating this ?

@d-a-v
Copy link
Collaborator Author

d-a-v commented Jan 23, 2018

With code

    void unref()
    {
        if(this != 0) {
            DEBUGV(":ur %d\r\n", _refcnt);
            if(--_refcnt == 0) {
                discard_received();
                close();
                if(_discard_cb) {
                    _discard_cb(_discard_cb_arg, this);
                }
                DEBUGV(":del\r\n");
os_printf("#(%p)",this);
                delete this;
            }
        }
    }

    int connect(ip_addr_t* addr, uint16_t port)
    {
        err_t err = tcp_connect(_pcb, addr, port, &ClientContext::_s_connected);
        if (err != ERR_OK) {
            return 0;
        }
os_printf("?(%p)",this);
        _connect_pending = 1;
        _op_start_time = millis();
        // This delay will be interrupted by esp_schedule in the connect callback
        delay(_timeout_ms);
delay(5000);
os_printf("!(%p)\n",this);
        _connect_pending = 0;
        if (state() != ESTABLISHED) {
            abort();
            return 0;
        }
        return 1;
    }

@igrr
Copy link
Member

igrr commented Jan 23, 2018

I think that the fact we do unref (inside WiFiClient::stopAll -> WiFiClient::stop) from OS callback is a bad idea. It might cause issues not only on connect, but also on read/write. Relying on specific order in which callbacks are dispatched (lwip error cb first, or wifi event cb first) has proven to be unreliable.
I need to revisit this again to find a more robust design, but it will be a few of days until i have time to do this.

@d-a-v
Copy link
Collaborator Author

d-a-v commented Jan 23, 2018

The problem is that we can be interrupted in yields (delay()/*yield()/esp_schedule()).
At the cost of two static variables, we can replace yields with if (!guarded_delay(X)) return which:

  • set static _inprogress to 1, _vanished to 0
  • call delay(X) (0 or timeout)
  • set _inprogress to 0, call delete this if _vanished is set, and return !_vanished

In unref(), instead of a simple delete, set _vanished if _inprogress otherwise call delete.

_inprogress must be initialized to 0, _vanished can be left unset.

...

Or afterall, in guarded_delay(X) we can

  • call ref()
  • call delay(X) (0 or timeout)
  • return !unref() (unref() would return 1 if it calls delete)

with no static variables

@d-a-v
Copy link
Collaborator Author

d-a-v commented Jan 24, 2018

I updated this PR with the "guarded_delay" proposal

@d-a-v d-a-v reopened this Jan 24, 2018
@d-a-v
Copy link
Collaborator Author

d-a-v commented Jan 27, 2018

I think all interrupted cases are handled.
@igrr Is this guarded delay satisfactory enough ?
Also the if (this != 0) could be avoided if _client is tested in WiFiClient (needed in only one place)

@igrr
Copy link
Member

igrr commented Jan 27, 2018

Usually a check that this != 0 is an indication that the logic in the code has some problems. Instead of trying to handle the case where the current objects gets deleted during a delay, I'd like to ensure that it does not get deleted. Essentially, the original code mixed together two things: signalling an error state and terminating object lifetime.
Working on a PR so that we can compare two approaches.

@devyte
Copy link
Collaborator

devyte commented Jan 29, 2018

@igrr I agree: use of delete this is generally a Bad Idea, and that is what is happening in this case. I think this is an example:

  • If the wifi connection goes down, the static method stopAll is called
  • stopAll iterates over all wificlients, calling stop()
  • stop() calls clientContext::unref()
  • unref() does delete this;

Therefore, if we want to do away with delete this, if(this), etc, I think we'd have to do away with the whole ref()/unref() model in clientcontext. The thing is, that would be a change that impacts pretty much all TCP-based apps, so I'd say it's a bit risky for a minor release.
OTOH, we do have @d-a-v 's stress testcase to test socket stability, which mitigates that.

@igrr
Copy link
Member

igrr commented Jan 29, 2018

we'd have to do away with the whole ref()/unref() model in clientcontext

I agree with that, using std::shared_ptr is not much more expensive but is a lot nicer. We still need to have the reference counting implementation so that sketches can copy WiFiClient object and pass it by value, as it is sometimes done in Arduino (thanks to the Java/Processing roots of the syntax).

Sorry, I got a bit overwhelmed by some other duties, but my solution, in a nutshell, to avoid deleting the ClientContext on stop. I call tcp_close and/or tcp_abort on the PCB, but keep the ClientContext itself around until the last WiFiClient which references it goes out of scope.

This seems to work but I want to do a stress test for this, with one ESP connected to other ESP's SoftAp, some TCP traffic between the two, and AP randomly shutting down and starting up.

@d-a-v
Copy link
Collaborator Author

d-a-v commented Feb 7, 2018

@igrr are you willing to do that for 2.4.1 ?
This PR can be updated to move the if(!this) from ClientContext::unref to if(_client)@WiFiClient::connectL142

@d-a-v
Copy link
Collaborator Author

d-a-v commented Feb 20, 2018

replaced by #4389

@d-a-v d-a-v closed this Feb 20, 2018
@d-a-v d-a-v deleted the 4078 branch February 20, 2018 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants