Skip to content

bug in delayMicroseconds() when called with 0 #1121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RobTillaart opened this issue Nov 18, 2012 · 12 comments
Closed

bug in delayMicroseconds() when called with 0 #1121

RobTillaart opened this issue Nov 18, 2012 · 12 comments
Labels
Component: Core Related to the code for the standard Arduino API

Comments

@RobTillaart
Copy link

see also - http://arduino.cc/forum/index.php/topic,132983.msg1000603.html#msg1000603

The delayMicroseconds function fails when called with a value of 0.
Although this is not done normally it can happen when called with a variable (result of some math). The bug can be traced back to the if (--us ==0) statement
BUG in 0.22 + 1.0.0 + 1.0.2

Find test program and patch below (the patch does not include the 1.0.2 20Mhz addition as I cannot test it)

Note: The patch is for up to 1.0.0 version.

Program to show bug

//
//    FILE: delayMicrosecondsBUG.pde
//  AUTHOR: Rob Tillaart
//    DATE: 2012-11-18
//
// PUPROSE: test delayMicroseconds()
//

void setup()
{
  Serial.begin(9600);
  Serial.println("start...");

  unsigned long m = micros();
  for (uint8_t i=0; i<100; i++)
  {
    delayMicroseconds(0);
  }
  Serial.println(micros()- m);

  m = micros();
  for (uint8_t i=0; i<100; i++)
  {
    delayMicroseconds(1);
  }
  Serial.println(micros()- m);

  m = micros();
  for (uint8_t i=0; i<100; i++)
  {
    delayMicroseconds(2);
  }
  Serial.println(micros()- m);

  m = micros();
  delayMicroseconds(20);
  Serial.println(micros()- m);
  m = micros();
  delayMicroseconds(200);
  Serial.println(micros()- m);
  m = micros();
  delayMicroseconds(2000);
  Serial.println(micros()- m);

}

void loop()
{}

Find below a patch for the function: (0.22 - 1.0.0)

void delayMicroseconds(unsigned int us)
{
    // calling avrlib's delay_us() function with low values (e.g. 1 or
    // 2 microseconds) gives delays longer than desired.
    //delay_us(us);

#if F_CPU >= 16000000L
    // for the 16 MHz clock on most Arduino boards

    // for a one-microsecond delay, simply return.  the overhead
    // of the function call yields a delay of approximately 1 1/8 us.
//  PATCH
//  if (--us == 0)
//      return;
    if (us < 2) return;
    us--;

    // the following loop takes a quarter of a microsecond (4 cycles)
    // per iteration, so execute it four times for each microsecond of
    // delay requested.
    us <<= 2;

    // account for the time taken in the preceeding commands.
    us -= 2;
#else
    // for the 8 MHz internal clock on the ATmega168

    // for a one- or two-microsecond delay, simply return.  the overhead of
    // the function calls takes more than two microseconds.  can't just
    // subtract two, since us is unsigned; we'd overflow.
// PATCHED LINES    
//  if (--us == 0)
//      return;
//  if (--us == 0)
//      return;
    if (us < 3) return;
    us -= 2;

    // the following loop takes half of a microsecond (4 cycles)
    // per iteration, so execute it twice for each microsecond of
    // delay requested.
    us <<= 1;

    // partially compensate for the time taken by the preceeding commands.
    // we can't subtract any more than this or we'd overflow w/ small delays.
    us--;
#endif

    // busy wait
    __asm__ __volatile__ (
        "1: sbiw %0,1" "\n\t" // 2 cycles
        "brne 1b" : "=w" (us) : "0" (us) // 2 cycles
    );
}
@ffissore
Copy link
Contributor

Hello @RobTillaart
with an Arduino UNO and your patch applied, the first println shows a consistent 120 delay, while the second prints 116
Could it be better to just avoid calling delayMicroseconds with something less than 1?

@RobTillaart
Copy link
Author

Hi,

First, 120 and 116 are in practice the same value as micros() has an accuracy of 4 us. As the same code path is executed for p=1 and p=0 this output is to be expected.

Second, of course it is better not to call delayMicroSeconds(0) but sometimes the parameter is the result of an calculation, an IO operation from a sensor or input from a user (etc). It would be better to handle zero value before the call is made, e.g. like:

if (p >= 1) delayMicroSeconds(p);

But for those people that still use param 0 the function should do its utter best to return asap. The old code did not do that.

I have considered to make delayMicroSeconds() a #define but macros are not trivial either (e.g. different frequencies) and it would possibly break compatibility (not investigated). A macro could prevent the call to the real delay function with an if like above. I did not explore it further but it should be something like

#if F_CPU >= 16000000L

#define delayMicroSeconds(p) (p)<1?0:delayUSec(p);

#else // assume 8MHz

#define delayMicroSeconds(p) (p)<2?0:delayUSec(p);

#end

Hope this helps,

Regards,
Rob

@nickgammon
Copy link

Any planned action on this? Surely delayMicroSeconds(0) should return as promptly as possible?

@RobTillaart
Copy link
Author

Current proposal returns as fast as possible. if (us < 2) return; is the first statement. If the delay is smaller than 2 micros the overhead of the function call itself is larger. The original code does a --us which causes a wrap around if us == 0 in the call. That is faulty by design.

The returned values of 120 and 116 are for 100 calls, so on average the call takes 1.18 micros,

The current implementation takes 1648292 micros for 100 calls delayMicros(0);

@nick,
I shall post a request on the developers list to have this fix added in the next release.

@cmaglie
Copy link
Member

cmaglie commented Nov 20, 2013

Have you already explored the possibility to inline the whole function?
something like this:

https://github.com/arduino/Arduino/blob/ide-1.5.x/hardware/arduino/sam/cores/arduino/wiring.h#L65

C

@RobTillaart
Copy link
Author

new patched function, with support for 20Mhz

/* Delay for the given number of microseconds.  Assumes a 8 or 16 MHz clock. */
void delayMicroseconds(unsigned int us)
{
    // calling avrlib's delay_us() function with low values (e.g. 1 or
    // 2 microseconds) gives delays longer than desired.
    //delay_us(us);
#if F_CPU >= 20000000L
    // for the 20 MHz clock on rare Arduino boards

    // for a one-microsecond delay, simply wait 2 cycle and return. The overhead
    // of the function call yields a delay of exactly a one microsecond.
    __asm__ __volatile__ (
        "nop" "\n\t"
        "nop"); //just waiting 2 cycle
//  PATCH
//  if (--us == 0)
//      return;
    if (us < 2) return;
    us--;

    // the following loop takes a 1/5 of a microsecond (4 cycles)
    // per iteration, so execute it five times for each microsecond of
    // delay requested.
    us = (us<<2) + us; // x5 us

    // account for the time taken in the preceeding commands.
    us -= 2;

#elif F_CPU >= 16000000L
    // for the 16 MHz clock on most Arduino boards

    // for a one-microsecond delay, simply return.  the overhead
    // of the function call yields a delay of approximately 1 1/8 us.
//  PATCH
//  if (--us == 0)
//      return;
    if (us < 2) return;
    us--;

    // the following loop takes a quarter of a microsecond (4 cycles)
    // per iteration, so execute it four times for each microsecond of
    // delay requested.
    us <<= 2;

    // account for the time taken in the preceeding commands.
    us -= 2;
#else
    // for the 8 MHz internal clock on the ATmega168

    // for a one- or two-microsecond delay, simply return.  the overhead of
    // the function calls takes more than two microseconds.  can't just
    // subtract two, since us is unsigned; we'd overflow.
// PATCH    
//  if (--us == 0)
//      return;
//  if (--us == 0)
//      return;
    if (us < 3) return;
    us -= 2;

    // the following loop takes half of a microsecond (4 cycles)
    // per iteration, so execute it twice for each microsecond of
    // delay requested.
    us <<= 1;

    // partially compensate for the time taken by the preceeding commands.
    // we can't subtract any more than this or we'd overflow w/ small delays.
    us--;
#endif

    // busy wait
    __asm__ __volatile__ (
        "1: sbiw %0,1" "\n\t" // 2 cycles
        "brne 1b" : "=w" (us) : "0" (us) // 2 cycles
    );
}

@RobTillaart
Copy link
Author

@cmaglie
The code that you refer to is this (right?)

static inline void delayMicroseconds(uint32_t usec){
    uint32_t n = usec * (VARIANT_MCK / 3000000);
    asm volatile(
        "L_%=_delayMicroseconds:"       "\n\t"
        "subs   %0, #1"                 "\n\t"
        "bge    L_%=_delayMicroseconds" "\n"
        : "+r" (n) :
    );

I'm not ARM assembler expert, but ...

  • assume the value of usec is zero (calculated value), then the variable n will be zero too.
    The first statement executed is subs %0, Add files used in Linux packages #1 ==> n= n-1 ==> n = 0-1 ==> 0xFFFFFFFF (unsigned math).
    Then the test bge is performed causing it to loop until the value n==zero is reached again.

I cannot verify this as I have my DUE not operational, but as far as I understand the assembler code, the above is what I expect to happen. It is a similar error to the if (--us == 0) return; We may not assume that the code is never called with usec == zero.

Can you measure the duration of a call to delayMicroseconds(0) on a DUE?

The code can be fixed, check for zero before anything else.

static inline void delayMicroseconds(uint32_t usec){
    if (usec == 0) return;  // or (usec < 2) ...
    uint32_t n = usec * (VARIANT_MCK / 3000000);
    asm volatile(
        "L_%=_delayMicroseconds:"       "\n\t"
        "subs   %0, #1"                 "\n\t"
        "bge    L_%=_delayMicroseconds" "\n"
        : "+r" (n) :
    );

opinion?

Thanks for your time,
Rob

@cmaglie
Copy link
Member

cmaglie commented Nov 21, 2013

I remember that the version I've taken from Teensy3 contained the "bne" opcode. I changed to "bge" to solve the delayMicrosecond(0) case:

http://forum.arduino.cc/index.php?topic=141030.msg1117412#msg1117412

As stated here:

https://sourceware.org/cgen/gen-doc/arm-thumb-insn.html#insn-bne
https://sourceware.org/cgen/gen-doc/arm-thumb-insn.html#insn-bge

the bge should exit if only ONE of Null-flag or Overflow-flag (V) is set. This should handle all cases except very long delays (>2^31uS) that are very unlike to happen.

I run this test on the Due:

#define DM1(x) delayMicroseconds(x)
#define DM5(x) DM1(x);DM1(x);DM1(x);DM1(x);DM1(x)
#define DM25(x) DM5(x);DM5(x);DM5(x);DM5(x);DM5(x)
#define DM125(x) DM25(x);DM25(x);DM25(x);DM25(x);DM25(x)
#define DM500(x) DM125(x);DM125(x);DM125(x);DM125(x)
#define DM1000(x) DM500(x);DM500(x)

void setup() {
  Serial.begin(115200);
  test(1);
  test(2);
  test(3);
  test(4);
  test(5);
  test(0);
  test(10);
}

void test(int d) {
  unsigned long t = micros();
  for (volatile int i = 0; i < 1000; i++) {
    DM1000(d);
  }
  t = micros() - t;
  Serial.print("delayMicroseconds(");
  Serial.print(d);
  Serial.print(") => ");
  Serial.print(String(((float)t)/1000000.0, 5));
  Serial.println("uS");
}

void loop() { }

The results with "bge" are:

delayMicroseconds(1) => 1.03707uS
delayMicroseconds(2) => 2.03821uS
delayMicroseconds(3) => 3.03956uS
delayMicroseconds(4) => 4.04048uS
delayMicroseconds(5) => 5.04201uS
delayMicroseconds(0) => 0.03745uS
delayMicroseconds(10) => 10.05086uS

The results with "bne" are:

delayMicroseconds(1) => 1.00131uS
delayMicroseconds(2) => 2.00251uS
delayMicroseconds(3) => 3.00364uS
delayMicroseconds(4) => 4.00485uS
delayMicroseconds(5) => 5.00595uS

bge is not as precise as bne, but I don't know why, maybe the 30nS drift are extra clock cycles eaten by bge at exit? On the other side we can see that bne fails with delays of 0.

C

@cmaglie cmaglie reopened this Nov 21, 2013
@RobTillaart
Copy link
Author

OK, the due has a correct delayMicroSeconds.

delayMicroseconds(1) => 1.03707uS
delayMicroseconds(2) => 2.03821uS
delayMicroseconds(3) => 3.03956uS
delayMicroseconds(4) => 4.04048uS
delayMicroseconds(5) => 5.04201uS
delayMicroseconds(0) => 0.03745uS
delayMicroseconds(10) => 10.05086uS

The errors are small and they seem relative linear
md(n) ==> n + 0.036... + 0.001*n

The "bne" version is a factor 10 more accurate and if a test for zero

if (usec == 0) return;

can be added?

Still the delayMicroseconds(0) fails on the AVR platform.

@cmaglie
Copy link
Member

cmaglie commented Dec 20, 2013

I discovered that the compiler is able to perform a very aggressive optimization (it can factor the formula to be executed only once per loop!). A more fair benchmark is the following:

static inline void delay_bne(uint32_t) __attribute__((always_inline, unused));
static inline void delay_bne(uint32_t usec) {
  uint32_t n = usec * (VARIANT_MCK / 3000000);
  asm volatile(
    "L_%=_delayMicroseconds:         \n\t"
    "subs   %0, #1                   \n\t"
    "bne    L_%=_delayMicroseconds   \n"
    : "+r" (n) :
  );
}

static inline void delay_bne_with_check(uint32_t) __attribute__((always_inline, unused));
static inline void delay_bne_with_check(uint32_t usec) {
  if (usec == 0) return;
  uint32_t n = usec * (VARIANT_MCK / 3000000);
  asm volatile(
    "L_%=_delayMicroseconds:         \n\t"
    "subs   %0, #1                   \n\t"
    "bne    L_%=_delayMicroseconds   \n"
    : "+r" (n) :
  );
}


static inline void delay_bge(uint32_t) __attribute__((always_inline, unused));
static inline void delay_bge(uint32_t usec) {
  uint32_t n = usec * (VARIANT_MCK / 3000000);
  asm volatile(
    "L_%=_delayMicroseconds:         \n\t"
    "subs   %0, #1                   \n\t"
    "bge    L_%=_delayMicroseconds   \n"
    : "+r" (n) :
  );
}

static inline void delay_bge_with_check(uint32_t) __attribute__((always_inline, unused));
static inline void delay_bge_with_check(uint32_t usec) {
  if (usec == 0) return;
  uint32_t n = usec * (VARIANT_MCK / 3000000);
  asm volatile(
    "L_%=_delayMicroseconds:         \n\t"
    "subs   %0, #1                   \n\t"
    "bge    L_%=_delayMicroseconds   \n"
    : "+r" (n) :
  );
}

#define DM1(x) delay_bge_with_check(x)
#define DM5(x) DM1(x);DM1(x);DM1(x);DM1(x);DM1(x)
#define DM25(x) DM5(x);DM5(x);DM5(x);DM5(x);DM5(x)
#define DM125(x) DM25(x);DM25(x);DM25(x);DM25(x);DM25(x)
#define DM500(x) DM125(x);DM125(x);DM125(x);DM125(x)
#define DM1000(x) DM500(x);DM500(x)

void setup() {
  Serial.begin(115200);
  test(1);
  test(2);
  test(3);
  test(4);
  test(5);
  test(10);
  test(0);
}

void test(int d) {
  volatile int _d = d; // No optimizations

  unsigned long elapsed = micros();
  for (int i = 0; i < 1000; i++) {
    DM1000(_d);
  }
  unsigned long elapsed_end = micros();
  float delta = (elapsed_end - elapsed);

  Serial.print("delayMicroseconds(");
  Serial.print(d);
  Serial.print(") => delta ");
  Serial.print(String(delta / 1000000.0, 5));
  Serial.print("uS, ");

  elapsed = micros();
  for (int i = 0; i < 1000; i++) {
    DM1000(d);
  }
  elapsed_end = micros();
  delta = (elapsed_end - elapsed);

  Serial.print("with optimization ");
  Serial.print(String(delta / 1000000.0, 5));
  Serial.println("uS");
}

void loop() { }

and here the results:

bne

delayMicroseconds(1) => delta 1.06090uS, with optimization 1.00134uS
delayMicroseconds(2) => delta 2.06250uS, with optimization 2.00256uS
delayMicroseconds(3) => delta 3.06342uS, with optimization 3.00373uS
delayMicroseconds(4) => delta 4.06504uS, with optimization 4.00497uS
delayMicroseconds(5) => delta 5.07067uS, with optimization 5.00622uS
delayMicroseconds(10) => delta 10.07956uS, with optimization 10.01182uS
fails with 0

bne with check

delayMicroseconds(1) => delta 1.06679uS, with optimization 1.00133uS
delayMicroseconds(2) => delta 2.06794uS, with optimization 2.00255uS
delayMicroseconds(3) => delta 3.06908uS, with optimization 3.00394uS
delayMicroseconds(4) => delta 4.07023uS, with optimization 4.00507uS
delayMicroseconds(5) => delta 5.07138uS, with optimization 5.00590uS
delayMicroseconds(10) => delta 10.07711uS, with optimization 10.01351uS
delayMicroseconds(0) => delta 0.08353uS, with optimization 0.00026uS

bge

delayMicroseconds(1) => delta 1.09665uS, with optimization 1.03707uS
delayMicroseconds(2) => delta 2.09842uS, with optimization 2.03826uS
delayMicroseconds(3) => delta 3.10063uS, with optimization 3.03955uS
delayMicroseconds(4) => delta 4.10021uS, with optimization 4.04083uS
delayMicroseconds(5) => delta 5.10138uS, with optimization 5.04180uS
delayMicroseconds(10) => delta 10.11929uS, with optimization 10.04818uS
delayMicroseconds(0) => delta 0.07165uS, with optimization 0.03738uS

bge with check

delayMicroseconds(1) => delta 1.10255uS, with optimization 1.03706uS
delayMicroseconds(2) => delta 2.10369uS, with optimization 2.03824uS
delayMicroseconds(3) => delta 3.10484uS, with optimization 3.03970uS
delayMicroseconds(4) => delta 4.10599uS, with optimization 4.04075uS
delayMicroseconds(5) => delta 5.10715uS, with optimization 5.04255uS
delayMicroseconds(10) => delta 10.11285uS, with optimization 10.05182uS
delayMicroseconds(0) => delta 0.08353uS, with optimization 0.00026uS

Looking at the "unoptimized" column (that I guess is the most common case) the precision of the bne version is not so striking against bge, but its still more precise.
It seems that your suggestion to use bne+check is the best compromise.

Now my original question: in the AVR core I see that delay is a non-inlined function, so we have to do all the tricks to bias the function call delays. Inlining the delay may help to simplify that?

C

@RobTillaart
Copy link
Author

BNE check   rel err BGE rel err

1,00000 1,06679 0,06679 1,09665 0,09665
2,00000 2,06794 0,03397 2,09842 0,04921
3,00000 3,06908 0,02303 3,10063 0,03354
4,00000 4,07023 0,01756 4,10021 0,02505
5,00000 5,07138 0,01428 5,10138 0,02028
10,00000 10,07711 0,00771 10,11929 0,01193

The relative error of BNE + CHECK versus BGE shows the difference more explicitely. BNEC has 25-30% less relative error than BGE

Conclusion go for BNEC.


Now my original question: in the AVR core I see that delay is a non-inlined function,
so we have to do all the tricks to bias the function call delays.
Inlining the delay may help to simplify that?`

No inlining will not help, the idea of the uDelay is that you calculate the number of iterations for a very tight loop to mimic a 'clock'. For very small values of the delay calculating the number of iterations already takes too much time. So even inlined code will need 'tricks' (maybe less). Furthermore inlining means larger footprint and although this might be little, it adds up (in my own code, libraries etc).

So for the AVR I propose not to inline, but add the check that tests for values of 0 and 1 before calculating the number of iterations.

My current (AVR) implementation differs a bit of the proposal above to get a better timing for the low values. Note: the 20 and 8 MHz versions are (manual) optimized on a 16Mhz board and are not confirmed yet on a real board.

/* Delay for the given number of microseconds.  Assumes a 8 or 16 MHz clock. */
void delayMicroseconds(unsigned int us)
{
    // calling avrlib's delay_us() function with low values (e.g. 1 or
    // 2 microseconds) gives delays longer than desired.
    // delay_us(us);

#if F_CPU == 20000000L
    // for the 20 MHz clock on rare Arduino boards

    // handle 0 and 1
    if (us == 0) return;
    __asm__ __volatile__ (
        "nop" "\n\t"
        "nop" "\n\t"
        "nop" "\n\t"
        "nop");
    if (us == 1) return;

    // calc # iterations
    us = (us<<2) + us; // x5 us

    // account for the time taken in the preceeding commands.
    us -= 7;

#elif F_CPU == 16000000L
    // for the 16 MHz clock on most Arduino boards

    // handle 0 and 1
    if (us == 0) return;
    if (us == 1) return;

    // calc # iterations
    us <<= 2;
    // account for the time taken in the preceeding commands.
    us -= 5;

#elif F_CPU == 8000000L
    // for the 8 MHz internal clock on the ATmega168

    // handle 0, 1;
    if (us < 2) return;
    // handle 2;
    if (us == 2) return;

    // calc # iterations
    us <<= 1;

    // account for the time taken in the preceeding commands.
    us -= 5;
    // quarter uSecond adjust
    __asm__ __volatile__ (
        "nop\n\t"
    );

#elif F_CPU == 1000000L

// TODO 1 MHz
#error "delayMicroseconds: 1 MHz not supported yet"

# else

#error "delayMicroseconds: not supported clockspeed"

#endif

    // busy wait
    __asm__ __volatile__ (
        "1: sbiw %0,1" "\n\t" // 2 cycles
        "brne 1b" : "=w" (us) : "0" (us) // 2 cycles
    );
}

16MHz gives the following output: (

delayMicroseconds(0) => 0.81920 ==> inf
delayMicroseconds(1) => 1.01880 ==> 0.01880
delayMicroseconds(2) => 2.02280 ==> 0.01140
delayMicroseconds(3) => 3.02880 ==> 0.00960
delayMicroseconds(4) => 4.03520 ==> 0.00880
delayMicroseconds(5) => 5.04080 ==> 0.00816
delayMicroseconds(10) => 10.07080 ==> 0.00708
delayMicroseconds(20) => 20.13200 ==> 0.00660
delayMicroseconds(50) => 50.31200 ==> 0.00624
delayMicroseconds(100) => 100.61360 ==> 0.00614

Last column is the relative error (inf due to divide by zero) and it is decreasing which is good. The value for DMS(0) is as good as it gets.

@cano64
Copy link

cano64 commented Feb 3, 2014

the issue has been fixed in this pull request #1678
I disassembled the elf file and counted clock cycles spent in the function, I also checked the results with an oscilloscope. The function is now very accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Core Related to the code for the standard Arduino API
Projects
None yet
Development

No branches or pull requests

6 participants