Skip to content

WiFi auto reconnect should occur for all disconnect reasons #7210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
RefactorFactory opened this issue Sep 2, 2022 · 7 comments · Fixed by #7344
Closed
1 task done

WiFi auto reconnect should occur for all disconnect reasons #7210

RefactorFactory opened this issue Sep 2, 2022 · 7 comments · Fixed by #7344
Assignees
Labels
Status: Needs investigation We need to do some research before taking next steps on this issue Status: To be implemented Selected for Development Type: Feature request Feature request for Arduino ESP32

Comments

@RefactorFactory
Copy link
Contributor

Board

n/a

Device Description

n/a

Hardware Configuration

n/a

Version

v2.0.4

IDE Name

n/a

Operating System

n/a

Flash frequency

n/a

PSRAM enabled

yes

Upload speed

n/a

Description

WiFiSTAClass has an _autoReconnect member, which defaults to true. When WiFiGeneric::_eventCallback() handles a ARDUINO_EVENT_WIFI_STA_DISCONNECTED event, it checks WiFiSTAClass::getAutoReconnect() and only reconnects if the following criteria is true:

https://github.com/espressif/arduino-esp32/blob/2.0.4/libraries/WiFi/src/WiFiGeneric.cpp#L971

        else if(WiFi.getAutoReconnect()){
            if((reason == WIFI_REASON_AUTH_EXPIRE) ||
            (reason >= WIFI_REASON_BEACON_TIMEOUT && reason != WIFI_REASON_AUTH_FAIL))
            {
                log_d("WiFi AutoReconnect Running");
                WiFi.disconnect();
                WiFi.begin();
            }
        }

The code above excludes reasons such as WIFI_REASON_ASSOC_EXPIRE, WIFI_REASON_NOT_ASSOCED, WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT, WIFI_REASON_GROUP_KEY_UPDATE_TIMEOUT and perhaps other reasons that may randomly occur.

I propose that reconnection should occur for more than the reasons currently in the code above because of the following:

  1. The ESP-IDF Programming Guide recommends reconnecting for more reasons:

https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/wifi.html#wifi-event-sta-disconnected

The most common event handle code for this event in application is to call esp_wifi_connect() to reconnect the Wi-Fi. However, if the event is raised because esp_wifi_disconnect() is called, the application should not call esp_wifi_connect() to reconnect. It is the application's responsibility to distinguish whether the event is caused by esp_wifi_disconnect() or other reasons. Sometimes a better reconnection strategy is required. Refer to Wi-Fi Reconnect and Scan When Wi-Fi Is Connecting.

https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/wifi.html#wi-fi-disconnect-phase

s6.2: In the scenario described above, the application event callback function relays WIFI_EVENT_STA_DISCONNECTED to the application task. The recommended actions are: 1) call esp_wifi_connect() to reconnect the Wi-Fi, 2) close all sockets, and 3) re-create them if necessary. For details, please refer to WIFI_EVENT_STA_DISCONNECTED.

https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/wifi.html#wi-fi-reconnect

The station may disconnect due to many reasons, e.g., the connected AP is restarted. It is the application's responsibility to reconnect. The recommended reconnection strategy is to call esp_wifi_connect() on receiving event WIFI_EVENT_STA_DISCONNECTED.

Sometimes the application needs more complex reconnection strategy:

  • If the disconnect event is raised because the esp_wifi_disconnect() is called, the application may not want to do the reconnection.
  • If the esp_wifi_scan_start() may be called at anytime, a better reconnection strategy is necessary. Refer to Scan When Wi-Fi Is Connecting.

Another thing that need to be considered is that the reconnection may not connect the same AP if there are more than one APs with the same SSID. The reconnection always select current best APs to connect.

  1. The ESP-IDF examples reconnect for all reasons:

https://github.com/espressif/esp-idf/blob/v4.4.2/examples/wifi/getting_started/station/main/station_example_main.c#L68
https://github.com/espressif/esp-idf/blob/v4.4.2/examples/provisioning/wifi_prov_mgr/main/app_main.c#L100

  1. ESP-Jumpstart, a "ready reference, a known set of best steps, gathered from previous experience of others" reconnects for all reasons:

https://github.com/espressif/esp-jumpstart/blob/bf3e26f2295730c8f6e9e7c08c897d2155064c5c/7_mfg/main/app_wifi.c#L112

  1. Auto reconnect is already on by default to provide a sensible default to new developers. Why not make that default also sensibly reconnect for all reasons?

Sketch

n/a

Debug Message

n/a

Other Steps to Reproduce

No response

I have checked existing issues, online documentation and the Troubleshooting Guide

  • I confirm I have checked existing issues, online documentation and Troubleshooting guide.
@RefactorFactory RefactorFactory added the Status: Awaiting triage Issue is waiting for triage label Sep 2, 2022
@SuGlider
Copy link
Collaborator

SuGlider commented Sep 3, 2022

@RefactorFactory - thanks for reporting such detailed and well explained issue!

Let's check it out and find a good fix.

@SuGlider SuGlider self-assigned this Sep 3, 2022
@SuGlider SuGlider added Type: Feature request Feature request for Arduino ESP32 Status: To be implemented Selected for Development Status: Needs investigation We need to do some research before taking next steps on this issue and removed Status: Awaiting triage Issue is waiting for triage labels Sep 3, 2022
@mrengineer7777
Copy link
Collaborator

I have noticed this on 2.0.3 (platform = https://github.com/tasmota/platform-espressif32/releases/download/v2.0.3/platform-espressif32-2.0.3.zip).

[ 16898][W][WiFiGeneric.cpp:873] _eventCallback(): Reason: 4 - ASSOC_EXPIRE WiFi disconnected 'WIFI_REASON_ASSOC_EXPIRE'

This event happens half the time after programming. Must reboot to get the device to connect.

@mrengineer7777
Copy link
Collaborator

mrengineer7777 commented Oct 11, 2022

The WiFi failure on "ASSOC_EXPIRE" is a serious bug for us, so I plan to submit a PR to fix this issue. Extensive analysis follows. Feedback wanted!

When ARDUINO_EVENT_WIFI_STA_DISCONNECTED occurs,
esp_err_t WiFiGenericClass::_eventCallback(arduino_event_t *event) attempts to reconnect for the following reasons:

  1. If connect fails at turn-on, will retry ONCE (ever) for these reasons:
    WIFI_REASON_AUTH_EXPIRE              = 2,
    WIFI_REASON_BEACON_TIMEOUT           = 200,
    WIFI_REASON_NO_AP_FOUND              = 201,
    WIFI_REASON_AUTH_FAIL                = 202,
    WIFI_REASON_ASSOC_FAIL               = 203,
    WIFI_REASON_HANDSHAKE_TIMEOUT        = 204,
    WIFI_REASON_CONNECTION_FAIL          = 205,
    WIFI_REASON_AP_TSF_RESET             = 206,
    WIFI_REASON_ROAMING                  = 207
  1. If WiFi.getAutoReconnect() enabled, will reconnect for these reasons:
    WIFI_REASON_AUTH_EXPIRE              = 2,
    WIFI_REASON_BEACON_TIMEOUT           = 200,
    WIFI_REASON_NO_AP_FOUND              = 201,

    WIFI_REASON_ASSOC_FAIL               = 203,
    WIFI_REASON_HANDSHAKE_TIMEOUT        = 204,
    WIFI_REASON_CONNECTION_FAIL          = 205,
    WIFI_REASON_AP_TSF_RESET             = 206,
    WIFI_REASON_ROAMING                  = 207

These are the current disconnect reasons from esp_wifi_types.h, wifi_err_reason_t:

    WIFI_REASON_UNSPECIFIED              = 1,
    WIFI_REASON_AUTH_EXPIRE              = 2,
    WIFI_REASON_AUTH_LEAVE               = 3,
    WIFI_REASON_ASSOC_EXPIRE             = 4,
    WIFI_REASON_ASSOC_TOOMANY            = 5,
    WIFI_REASON_NOT_AUTHED               = 6,
    WIFI_REASON_NOT_ASSOCED              = 7,
    WIFI_REASON_ASSOC_LEAVE              = 8,
    WIFI_REASON_ASSOC_NOT_AUTHED         = 9,
    WIFI_REASON_DISASSOC_PWRCAP_BAD      = 10,
    WIFI_REASON_DISASSOC_SUPCHAN_BAD     = 11,
    WIFI_REASON_BSS_TRANSITION_DISASSOC  = 12,
    WIFI_REASON_IE_INVALID               = 13,
    WIFI_REASON_MIC_FAILURE              = 14,
    WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT   = 15,
    WIFI_REASON_GROUP_KEY_UPDATE_TIMEOUT = 16,
    WIFI_REASON_IE_IN_4WAY_DIFFERS       = 17,
    WIFI_REASON_GROUP_CIPHER_INVALID     = 18,
    WIFI_REASON_PAIRWISE_CIPHER_INVALID  = 19,
    WIFI_REASON_AKMP_INVALID             = 20,
    WIFI_REASON_UNSUPP_RSN_IE_VERSION    = 21,
    WIFI_REASON_INVALID_RSN_IE_CAP       = 22,
    WIFI_REASON_802_1X_AUTH_FAILED       = 23,
    WIFI_REASON_CIPHER_SUITE_REJECTED    = 24,
    WIFI_REASON_INVALID_PMKID            = 53,
    WIFI_REASON_BEACON_TIMEOUT           = 200,
    WIFI_REASON_NO_AP_FOUND              = 201,
    WIFI_REASON_AUTH_FAIL                = 202,
    WIFI_REASON_ASSOC_FAIL               = 203,
    WIFI_REASON_HANDSHAKE_TIMEOUT        = 204,
    WIFI_REASON_CONNECTION_FAIL          = 205,
    WIFI_REASON_AP_TSF_RESET             = 206,
    WIFI_REASON_ROAMING                  = 207,

Based on my understanding of scan-when-wi-fi-is-connecting , I would break down the disconnect reasons as:

Disconnected
    WIFI_REASON_ASSOC_LEAVE              = 8,       //Client voluntarily disconnected from AP. Do not reconnect!

Fatal
    WIFI_REASON_UNSPECIFIED              = 1,       //Internal failure (e.g. out of memory) or msg from AP
    WIFI_REASON_DISASSOC_PWRCAP_BAD      = 10,      //Bad power setting
    WIFI_REASON_DISASSOC_SUPCHAN_BAD     = 11,      //Bad channel setting
    WIFI_REASON_IE_INVALID               = 13,      //Invalid element
    WIFI_REASON_UNSUPP_RSN_IE_VERSION    = 21,      //Unsupported RSNE version
    WIFI_REASON_CIPHER_SUITE_REJECTED    = 24,      //Cipher suite rejected due to security policies
    WIFI_REASON_AUTH_FAIL                = 202,     //Auth failed :(

Timeouts (retry)
    WIFI_REASON_AUTH_EXPIRE              = 2,       //Timed out during auth or AP sent reason
    WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT   = 15,      //Timed out during 4-way handshake (ESP uses WIFI_REASON_HANDSHAKE_TIMEOUT instead)
    WIFI_REASON_GROUP_KEY_UPDATE_TIMEOUT = 16,      //Group key handshake times out
    WIFI_REASON_802_1X_AUTH_FAILED       = 23,      //802.1X auth failed. Best guess: enterprise radius certificate error or client timeout waiting for server response.
    WIFI_REASON_HANDSHAKE_TIMEOUT        = 204,     //Same as WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT

Transient error (reconnect)
    WIFI_REASON_AUTH_LEAVE               = 3,       //AP is leaving (rebooting?)
    WIFI_REASON_ASSOC_EXPIRE             = 4,       //AP disconnected client due to inactivity
    WIFI_REASON_ASSOC_TOOMANY            = 5,       //AP cannot handle any more clients at this time
    WIFI_REASON_NOT_AUTHED               = 6,       //Client not authenticated
    WIFI_REASON_NOT_ASSOCED              = 7,       //Client not associated
    WIFI_REASON_ASSOC_NOT_AUTHED         = 9,       //Client sent data while associated but not authenticated
    WIFI_REASON_MIC_FAILURE              = 14,      //Message integrity code failure
    WIFI_REASON_IE_IN_4WAY_DIFFERS       = 17,      //The element in the four-way handshake is different from the (Re-)Association Request/Probe and Response/Beacon frame
    WIFI_REASON_INVALID_PMKID            = 53,      //?? Undocumented. PMK is reused to create session keys between the client and the roamed to AP.
    WIFI_REASON_BEACON_TIMEOUT           = 200,     //Client is no longer hearing beacons from AP and has failed 5 probe requests.  AP is likely offline.
    WIFI_REASON_NO_AP_FOUND              = 201,     //Unable to scan the specified AP (SSID or BSSID).  I believe the ESP32 must find the AP in scan list before it will connect. Incorrect SSID or AP offline.
    WIFI_REASON_ASSOC_FAIL               = 203,     //Association failed
    WIFI_REASON_CONNECTION_FAIL          = 205,     //Espressif-specific Wi-Fi reason code: the connection to the AP has failed.
    WIFI_REASON_AP_TSF_RESET             = 206,     //?? Undocumentated.  TSF in an timestamp kept by AP and possibly clients.  I see a fix in ESP-IDF for disconnects here: https://github.com/espressif/esp32-wifi-lib/commit/435347a24cec805f81319d3ac6d2a2f17da57bd5
    WIFI_REASON_ROAMING                  = 207,     //?? Undocumentated.  Per Google roaming is triggered by client when it gets too far from an AP and wants to connect to a stronger one.

Unknown
    WIFI_REASON_BSS_TRANSITION_DISASSOC  = 12,      //?? Undocumentated
    WIFI_REASON_GROUP_CIPHER_INVALID     = 18,      //Group ciper invalid
    WIFI_REASON_PAIRWISE_CIPHER_INVALID  = 19,      //Pairwise cipher invalid
    WIFI_REASON_AKMP_INVALID             = 20,      //AKMP invalid
    WIFI_REASON_INVALID_RSN_IE_CAP       = 22,      //RSNE invalid

Since the initial retry fires on AUTH_FAIL, I would argue it should retry for ALL disconnect reasons except WIFI_REASON_ASSOC_LEAVE.

I believe the auto-reconnect should trigger for all Timeout and Transient errors. I don't know what to do with the Unknown reasons.

Note to self: will be submitting PR against WiFiGeneric.cpp

mrengineer7777 added a commit to mrengineer7777/arduino-esp32 that referenced this issue Oct 11, 2022
mrengineer7777 added a commit to mrengineer7777/arduino-esp32 that referenced this issue Oct 11, 2022
mrengineer7777 added a commit to mrengineer7777/arduino-esp32 that referenced this issue Oct 11, 2022
mrengineer7777 added a commit to mrengineer7777/arduino-esp32 that referenced this issue Oct 11, 2022
@RefactorFactory
Copy link
Contributor Author

While #7344 is an great improvement over the existing code and it handles all the WiFi disconnect reasons that I've personally seen, why not just do as the official ESP-IDF samples?

Consider if there is a problem with #7344 in the future. At that time, if we ask Espressif, they might say "hmm, we never encountered such a problem because we never do that in our ESP-IDF samples (and perhaps other tests)." This hypothetical problem could have been avoided by matching what Espressif does in their code.

Another way to look at this: is it a "bug" in the ESP-IDF samples that they don't do something as complicated as we're suggesting for Arduino-esp32? Perhaps the ESP-IDF samples are "simple" because they're just samples, so it's ok for them, but not real projects?

BTW, I think esphome retries on all disconnect reasons, but I'm not 100% sure because their code is more complicated.

@mrengineer7777
Copy link
Collaborator

I think a permanent reconnect regardless of the error would be a bad thing. e.g. The WiFi would keep retrying on bad credentials or incorrect configuration. In my app I need the permanent failure before I can rollback to SoftAP for recovery.

Looking at this IDF sample: https://github.com/espressif/esp-idf/blob/v4.4.2/examples/wifi/getting_started/station/main/station_example_main.c#L68
Having a fixed number of retries before failing is reasonable/simple approach. Note the retry count will have to be reset at start of connection attempt, or after a successful connection as the example demonstrates. Thoughts?

@RefactorFactory
Copy link
Contributor Author

I think a permanent reconnect regardless of the error would be a bad thing. e.g. The WiFi would keep retrying on bad credentials or incorrect configuration. In my app I need the permanent failure before I can rollback to SoftAP for recovery.

Ok, how permanent reconnect unless the disconnect reason is one of a small list (for bad credentials or incorrect configuration)? Then again, are you sure that a WiFi router will not momentarily return "bad credentials"?

Looking at this IDF sample: https://github.com/espressif/esp-idf/blob/v4.4.2/examples/wifi/getting_started/station/main/station_example_main.c#L68 Having a fixed number of retries before failing is reasonable/simple approach. Note the retry count will have to be reset at start of connection attempt, or after a successful connection as the example demonstrates. Thoughts?

That sample uses a default of 5 retries (unless one runs idf.py menuconfig). I think that is inappropriate for a "safe default."

Imagine putting together a project, setting it up, taking a trip, and then the retries fail while you're many hours or days away. This can cause real headaches and annoyances because you're not onsite to deal with the issue. I'm hoping that we can have a more reasonable default behavior to help scenarios like this.

Thanks.

mrengineer7777 added a commit to mrengineer7777/arduino-esp32 that referenced this issue Oct 14, 2022
mrengineer7777 added a commit to mrengineer7777/arduino-esp32 that referenced this issue Nov 4, 2022
@podaen
Copy link

podaen commented Dec 11, 2022

I have that same issue ASSOC_LEAVE. I am wondering...

  1. Are you using an mqtt lib in this sketch or
  2. Do you have WiFiClientSecure?
  3. Does it only happens when more than one client is doing something?

Because this is what I am thinking it could be the cause of the ASSOC_LEAVE. I am building a music player and reconnecting is not intresting. Could you give only that imput? Many thanks dave.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Needs investigation We need to do some research before taking next steps on this issue Status: To be implemented Selected for Development Type: Feature request Feature request for Arduino ESP32
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants