Skip to content

Serious problem with millis() #2430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Jeroen88 opened this issue Feb 6, 2019 · 16 comments
Closed

Serious problem with millis() #2430

Jeroen88 opened this issue Feb 6, 2019 · 16 comments

Comments

@Jeroen88
Copy link
Contributor

Jeroen88 commented Feb 6, 2019

Hardware:

Board: ESP32 Dev Module nodeMCU 32s
Core Installation version: latest git
IDE name: Arduino IDE
Flash Frequency: 40Mhz
PSRAM enabled: ?no? ?yes?
Upload Speed: default speed
Computer OS: Ubuntu

Description:

millis() is not working as expected, causing time out problems every 72 minutes in many functions of the core and user programs

Sketch: (leave the backquotes for code formatting)

void setup() {
  // put your setup code here, to run once:
  Serial.begin(115200);
  Serial.println("\nDemonstrate often used timeout loop error\nHappens every about 72 minutes");

  Serial.printf("0xFFFFFFFF + 100 is %lu\n", 0xFFFFFFFFUL + 100UL);
  Serial.printf("Negatif will be %lu\n", -4294967);
}

void loop() {
  // put your main code here, to run repeatedly:
  unsigned long start = millis();
  unsigned long duration;
  unsigned long startMicros = micros();
  unsigned long durationMicros;
  for(;;) {
    duration = millis() - start;
    durationMicros = micros() - startMicros;
    if(duration > 0x7FFFFFFFUL) {
      Serial.printf("\r\nMillis() rolled over: %u. In micros / 1000: %u", duration, durationMicros / 1000UL);
      for(;;) delay(1);
    }
    if(duration >= 3000) break;
    delay(1);
  }
  Serial.printf("\r\nTimeout took %lu ms. Wait another about %lu minutes for the error to occur", duration, (0xFFFFFFFF - micros()) / 60000000UL);
}

Debug Messages:

15:33:49.694 -> Demonstrate often used timeout loop error
15:33:49.694 -> Happens every about 72 minutes
15:33:49.694 -> 0xFFFFFFFF + 100 is 99
15:33:49.694 -> Negatif will be 4290672329

15:33:52.682 -> Timeout took 3000 ms. Wait another about 71 minutes for the error to occur
15:33:55.700 -> Timeout took 3000 ms. Wait another about 71 minutes for the error to occur
15:33:58.688 -> Timeout took 3000 ms. Wait another about 71 minutes for the error to occur
…
16:45:16.791 -> Timeout took 3000 ms. Wait another about 0 minutes for the error to occur
16:45:19.778 -> Timeout took 3000 ms. Wait another about 0 minutes for the error to occur
16:45:22.800 -> Timeout took 3000 ms. Wait another about 0 minutes for the error to occur
16:45:24.726 -> Millis() rolled over: 4290674261. In micros / 1000: 1932

Issue #2330 was closed, but the issue is not solved in the release mentioned in that PR. In that release the type of micros() and millis() is changed to unsigned long, but this is not solving the issue!

In my honest opinion, this is a very serious error!

The problem is that many timeout loops depend on 2’s complement arithmetic and millis() being defined as:

unsigned long IRAM_ATTR millis()
{
    return (unsigned long) (esp_timer_get_time() / 1000);
}

See

unsigned long IRAM_ATTR millis()

Take for instance in Stream.cpp timedRead

int Stream::timedRead()
:

int Stream::timedRead()
{
    int c;
    _startMillis = millis();
    do {
        c = read();
        if(c >= 0) {
            return c;
        }
    } while(millis() - _startMillis < _timeout);
    return -1;     // -1 indicates timeout
}

The default _timeout is 1000 ms.

Now assume that esp_timer_get_time() is about to roll over at the moment that _startMillis() = millis() is assigned. So assume esp_timer_get_time() is 0xFFFFFFFF, and millis() will be 0xFFFFFFFF / 1000 is 4294967295 / 1000 is 4294967. Thus _startMillis will be 4294967.

Next assume that one do { } while loop takes 100 microseconds (the exact duration is not important to demonstrate the effect.

At the time of the evaluation of the while condition, micros() will be 0xFFFFFFFF + 100 is 99, because of 2’s complement arithmetic. So millis() will be 99 / 1000 is 0.

The while condition at the loop will evaluate to 0 - 4294967 is -4294967. Also because of 2’s complement arithmetic this evaluates to 4290672329. The while condition evaluates to 4290672329 < 1000, which is obviously false. So the while loop stops after only 100 _micro_seconds of waiting (the assumed duration of one iteration of the while loop), instead of the intended 1000 _milli_seconds!!!!

In the above example the same thing happens in a real example. The last timeout, exactly after about 72 minutes of running lasted only 1932 ms instead of the wanted 3000 ms.

Stream is in the heart of many programs. And this is just one example. Many timeout loops use the same structure: millis() - start < timeout

@me-no-dev, could you take at look at this?

@bertmelis
Copy link
Contributor

Is it a solution to implement esp8266's logic into esp32? Only difference is esp8266 starts from a uint32_t system timer whereas esp32 has a int64_t timer; both microseconds though.

https://github.com/esp8266/Arduino/blob/master/cores/esp8266/core_esp8266_wiring.c#L64

@stickbreaker
Copy link
Contributor

@Jeroen88 try changing the equation of millis(), define 1000 as a long long.

unsigned long IRAM_ATTR millis()
{
    return (unsigned long) (esp_timer_get_time() / 1000LL);
}

Chuck.

@bertmelis
Copy link
Contributor

isn't that conversion implicit?

@stickbreaker
Copy link
Contributor

The compiler can choose long, because output type is long, one param is long, and processer natural is long. Converting to long first is more efficient. One conversion, natural division, no output conversion.

Don't allow it the choice.

Chuck.

@Jeroen88
Copy link
Contributor Author

Jeroen88 commented Feb 7, 2019

@bertmelis, @stickbreaker Bert, Chuck thnx for the reactions!!

I read the extensive comment at the ESP8266 version of millis() here This uses some "magic" algorithm to do a fast division. It also states that it uses the full 32 bit range for the millis() counter (see here).

The problem is that the ESP32 does not use the full 32 bit range, that is causing the issue. I tested the example code with an ESP8266 and there it works correctly.

@stickbreaker I don't think explicitly changing the type to LL will help. The source counter micros() uses the full 32 bits causing the 'trick' to substract the start_Micros from micros() to always work because of the 2's complement arithmetic. However, because of the division, millis() is not using the full 32 bits, causing trouble around the moment of overflowing of the micros()

@bertmelis
Copy link
Contributor

So the esp8266 implementation should work? (after masking 64-bit micros value with 0xFFFFFFFF to get lower 32-bit part)

Issue is indeed the mix of decimal math with binary rollover logic.

@Jeroen88
Copy link
Contributor Author

Jeroen88 commented Feb 7, 2019

@bertmelis I think it will work, and I think both the ESP32 and the ESP8266 use a unsigned long system timer, so masking should not be necessary

@asetyde
Copy link

asetyde commented Feb 7, 2019

Therefore how we can fix now on 1.0.1, I use a lot mills and I've strange issues link to this

@bertmelis
Copy link
Contributor

it is signed according to this: https://docs.espressif.com/projects/esp-idf/en/latest/api-reference/system/esp_timer.html#_CPPv218esp_timer_get_timev

Masking to avoid compiler warnings (storing 64 bit into 32).

@stickbreaker
Copy link
Contributor

@Jeroen88

@stickbreaker I don't think explicitly changing the type to LL will help. The source counter micros() uses the full 32 bits causing the 'trick' to substract the start_Micros from micros() to always work because of the 2's complement arithmetic. However, because of the division, millis() is not using the full 32 bits, causing trouble around the moment of overflowing of the micros()

Just try modifying the millis() code.

Now assume that esp_timer_get_time() is about to roll over at the moment that _startMillis() = millis() is assigned. So assume esp_timer_get_time() is 0xFFFFFFFF, and millis() will be 0xFFFFFFFF / 1000 is 4294967295 / 1000 is 4294967. Thus _startMillis will be 4294967.

int64_t esp_timer_get_time() is microseconds since the timer subsystem started,(approximately boot time). 2**63ms =292271..... YEARS. int64_t overflow is irrelevant. esp_timer_get_time() will always be POSITIVE.

The problem you are seeing is the early conversion from int64_t to uint32_t just force the compiler to do what you want. Use the full resolution of the timer as input to the division.

Chuck.

@Jeroen88
Copy link
Contributor Author

Jeroen88 commented Feb 7, 2019

@stickbreaker Chuck, I believe you are right! Thanks for your insights. I have no access to my systems now, but will try it this weekend and, if succesful, make a PR.

@bertmelis
Copy link
Contributor

bertmelis commented Feb 7, 2019

On Ideone.com:

#include <iostream>
#include <stdint.h>
using namespace std;

int main() {
	int64_t systemTimer = 0xFFFFFFFF;
	uint32_t lastMillis = (unsigned long) (systemTimer / 1000LL);
	systemTimer += 10; // 10 µsecs later...
	uint32_t currentMillis = (unsigned long) (systemTimer / 1000LL);
	
	if (currentMillis - lastMillis > 1000UL) {
		cout << "This shouldn't happen!\n";
	} else {
		cout << "All ok!\n";
	}
	
	systemTimer += 1000000; // 1000 msecs later...
	currentMillis = (unsigned long) (systemTimer / 1000LL);
	if (currentMillis - lastMillis > 1000UL) {
		cout << "1000 msecs barely  passed.\n";
	} else {
		cout << "corner case\n";
	}
	
	systemTimer += 1000; // 1 msecs later...
	currentMillis = (unsigned long) (systemTimer / 1000LL);
	if (currentMillis - lastMillis > 1000UL) {
		cout << "1001 msecs passed\n";
	} else {
		cout << "coups\n";
	}
	return 0;
}

returns:

All ok!
corner case
1001 msecs passed

@Jeroen88
Copy link
Contributor Author

Jeroen88 commented Feb 7, 2019

@bertmelis :), thnx! I think @stickbreaker suggestion will indeed work. I will make a PR this weekend

@devyte
Copy link

devyte commented Feb 7, 2019

I haven't looked at this core's code, and I haven't read this discussion in detail, so I don't know if this helps, but I thought to mention it anyways:

  • Please look at the millis() implementation of the ESP8266 core. There was a long discussion there, and the current code has been unit tested, proven to be correct without drift (the /1000 is an approximation) and optimized to avoid 64bit calculations (64bit types caused over a 4X execution time increase of millis() over the original code).
  • Peripherally related, please look at PolledTimeout.h in the ESP8266 core. It encapsulates the correct way to do millis() based timing calculations, including correctness for rollover. It is also unit tested. There are examples as well to showcase usage.

@Jeroen88
Copy link
Contributor Author

Jeroen88 commented Feb 8, 2019

@devyte

  • The ESP8266 millis() uses 'magic numbers' and stores the overflow of the 32 bit counter to overcome the problem of drift. The ESP32 timer seems to be a 64 bit natively. So I expect following @stickbreaker 's suggestion to just divide by 1000LL (or maybe even better 1000ULL) will be correct. As far as I understand it, gcc always uses magic numbers to do divisions, so I expect the code esp_timer_get_time() / 1000ULL to be optimized by the compiler already using magic numbers!!
  • PolledTimeout.h uses the 'normal' timeout check: return (t - _start) >= _timeout, see here, so this library is not relevant for the issue here

@Jeroen88
Copy link
Contributor Author

Jeroen88 commented Feb 8, 2019

Solved by PR #2438

@Jeroen88 Jeroen88 closed this as completed Feb 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants