-
-
Notifications
You must be signed in to change notification settings - Fork 7k
bug in delayMicroseconds() when called with 0 #1121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hello @RobTillaart |
Hi, First, 120 and 116 are in practice the same value as micros() has an accuracy of 4 us. As the same code path is executed for p=1 and p=0 this output is to be expected. Second, of course it is better not to call delayMicroSeconds(0) but sometimes the parameter is the result of an calculation, an IO operation from a sensor or input from a user (etc). It would be better to handle zero value before the call is made, e.g. like: if (p >= 1) delayMicroSeconds(p); But for those people that still use param 0 the function should do its utter best to return asap. The old code did not do that. I have considered to make delayMicroSeconds() a #define but macros are not trivial either (e.g. different frequencies) and it would possibly break compatibility (not investigated). A macro could prevent the call to the real delay function with an if like above. I did not explore it further but it should be something like #if F_CPU >= 16000000L #define delayMicroSeconds(p) (p)<1?0:delayUSec(p); #else // assume 8MHz #define delayMicroSeconds(p) (p)<2?0:delayUSec(p); #end Hope this helps, Regards, |
Any planned action on this? Surely delayMicroSeconds(0) should return as promptly as possible? |
Current proposal returns as fast as possible. if (us < 2) return; is the first statement. If the delay is smaller than 2 micros the overhead of the function call itself is larger. The original code does a --us which causes a wrap around if us == 0 in the call. That is faulty by design. The returned values of 120 and 116 are for 100 calls, so on average the call takes 1.18 micros, The current implementation takes 1648292 micros for 100 calls delayMicros(0); @nick, |
Have you already explored the possibility to inline the whole function? C |
new patched function, with support for 20Mhz /* Delay for the given number of microseconds. Assumes a 8 or 16 MHz clock. */
void delayMicroseconds(unsigned int us)
{
// calling avrlib's delay_us() function with low values (e.g. 1 or
// 2 microseconds) gives delays longer than desired.
//delay_us(us);
#if F_CPU >= 20000000L
// for the 20 MHz clock on rare Arduino boards
// for a one-microsecond delay, simply wait 2 cycle and return. The overhead
// of the function call yields a delay of exactly a one microsecond.
__asm__ __volatile__ (
"nop" "\n\t"
"nop"); //just waiting 2 cycle
// PATCH
// if (--us == 0)
// return;
if (us < 2) return;
us--;
// the following loop takes a 1/5 of a microsecond (4 cycles)
// per iteration, so execute it five times for each microsecond of
// delay requested.
us = (us<<2) + us; // x5 us
// account for the time taken in the preceeding commands.
us -= 2;
#elif F_CPU >= 16000000L
// for the 16 MHz clock on most Arduino boards
// for a one-microsecond delay, simply return. the overhead
// of the function call yields a delay of approximately 1 1/8 us.
// PATCH
// if (--us == 0)
// return;
if (us < 2) return;
us--;
// the following loop takes a quarter of a microsecond (4 cycles)
// per iteration, so execute it four times for each microsecond of
// delay requested.
us <<= 2;
// account for the time taken in the preceeding commands.
us -= 2;
#else
// for the 8 MHz internal clock on the ATmega168
// for a one- or two-microsecond delay, simply return. the overhead of
// the function calls takes more than two microseconds. can't just
// subtract two, since us is unsigned; we'd overflow.
// PATCH
// if (--us == 0)
// return;
// if (--us == 0)
// return;
if (us < 3) return;
us -= 2;
// the following loop takes half of a microsecond (4 cycles)
// per iteration, so execute it twice for each microsecond of
// delay requested.
us <<= 1;
// partially compensate for the time taken by the preceeding commands.
// we can't subtract any more than this or we'd overflow w/ small delays.
us--;
#endif
// busy wait
__asm__ __volatile__ (
"1: sbiw %0,1" "\n\t" // 2 cycles
"brne 1b" : "=w" (us) : "0" (us) // 2 cycles
);
} |
@cmaglie static inline void delayMicroseconds(uint32_t usec){
uint32_t n = usec * (VARIANT_MCK / 3000000);
asm volatile(
"L_%=_delayMicroseconds:" "\n\t"
"subs %0, #1" "\n\t"
"bge L_%=_delayMicroseconds" "\n"
: "+r" (n) :
); I'm not ARM assembler expert, but ...
I cannot verify this as I have my DUE not operational, but as far as I understand the assembler code, the above is what I expect to happen. It is a similar error to the if (--us == 0) return; We may not assume that the code is never called with usec == zero. Can you measure the duration of a call to delayMicroseconds(0) on a DUE? The code can be fixed, check for zero before anything else. static inline void delayMicroseconds(uint32_t usec){
if (usec == 0) return; // or (usec < 2) ...
uint32_t n = usec * (VARIANT_MCK / 3000000);
asm volatile(
"L_%=_delayMicroseconds:" "\n\t"
"subs %0, #1" "\n\t"
"bge L_%=_delayMicroseconds" "\n"
: "+r" (n) :
); opinion? Thanks for your time, |
I remember that the version I've taken from Teensy3 contained the "bne" opcode. I changed to "bge" to solve the delayMicrosecond(0) case: http://forum.arduino.cc/index.php?topic=141030.msg1117412#msg1117412 As stated here: https://sourceware.org/cgen/gen-doc/arm-thumb-insn.html#insn-bne the bge should exit if only ONE of Null-flag or Overflow-flag (V) is set. This should handle all cases except very long delays (>2^31uS) that are very unlike to happen. I run this test on the Due: #define DM1(x) delayMicroseconds(x)
#define DM5(x) DM1(x);DM1(x);DM1(x);DM1(x);DM1(x)
#define DM25(x) DM5(x);DM5(x);DM5(x);DM5(x);DM5(x)
#define DM125(x) DM25(x);DM25(x);DM25(x);DM25(x);DM25(x)
#define DM500(x) DM125(x);DM125(x);DM125(x);DM125(x)
#define DM1000(x) DM500(x);DM500(x)
void setup() {
Serial.begin(115200);
test(1);
test(2);
test(3);
test(4);
test(5);
test(0);
test(10);
}
void test(int d) {
unsigned long t = micros();
for (volatile int i = 0; i < 1000; i++) {
DM1000(d);
}
t = micros() - t;
Serial.print("delayMicroseconds(");
Serial.print(d);
Serial.print(") => ");
Serial.print(String(((float)t)/1000000.0, 5));
Serial.println("uS");
}
void loop() { } The results with "bge" are:
The results with "bne" are:
bge is not as precise as bne, but I don't know why, maybe the 30nS drift are extra clock cycles eaten by bge at exit? On the other side we can see that bne fails with delays of 0. C |
OK, the due has a correct delayMicroSeconds. delayMicroseconds(1) => 1.03707uS The errors are small and they seem relative linear The "bne" version is a factor 10 more accurate and if a test for zero
can be added? Still the delayMicroseconds(0) fails on the AVR platform. |
I discovered that the compiler is able to perform a very aggressive optimization (it can factor the formula to be executed only once per loop!). A more fair benchmark is the following: static inline void delay_bne(uint32_t) __attribute__((always_inline, unused));
static inline void delay_bne(uint32_t usec) {
uint32_t n = usec * (VARIANT_MCK / 3000000);
asm volatile(
"L_%=_delayMicroseconds: \n\t"
"subs %0, #1 \n\t"
"bne L_%=_delayMicroseconds \n"
: "+r" (n) :
);
}
static inline void delay_bne_with_check(uint32_t) __attribute__((always_inline, unused));
static inline void delay_bne_with_check(uint32_t usec) {
if (usec == 0) return;
uint32_t n = usec * (VARIANT_MCK / 3000000);
asm volatile(
"L_%=_delayMicroseconds: \n\t"
"subs %0, #1 \n\t"
"bne L_%=_delayMicroseconds \n"
: "+r" (n) :
);
}
static inline void delay_bge(uint32_t) __attribute__((always_inline, unused));
static inline void delay_bge(uint32_t usec) {
uint32_t n = usec * (VARIANT_MCK / 3000000);
asm volatile(
"L_%=_delayMicroseconds: \n\t"
"subs %0, #1 \n\t"
"bge L_%=_delayMicroseconds \n"
: "+r" (n) :
);
}
static inline void delay_bge_with_check(uint32_t) __attribute__((always_inline, unused));
static inline void delay_bge_with_check(uint32_t usec) {
if (usec == 0) return;
uint32_t n = usec * (VARIANT_MCK / 3000000);
asm volatile(
"L_%=_delayMicroseconds: \n\t"
"subs %0, #1 \n\t"
"bge L_%=_delayMicroseconds \n"
: "+r" (n) :
);
}
#define DM1(x) delay_bge_with_check(x)
#define DM5(x) DM1(x);DM1(x);DM1(x);DM1(x);DM1(x)
#define DM25(x) DM5(x);DM5(x);DM5(x);DM5(x);DM5(x)
#define DM125(x) DM25(x);DM25(x);DM25(x);DM25(x);DM25(x)
#define DM500(x) DM125(x);DM125(x);DM125(x);DM125(x)
#define DM1000(x) DM500(x);DM500(x)
void setup() {
Serial.begin(115200);
test(1);
test(2);
test(3);
test(4);
test(5);
test(10);
test(0);
}
void test(int d) {
volatile int _d = d; // No optimizations
unsigned long elapsed = micros();
for (int i = 0; i < 1000; i++) {
DM1000(_d);
}
unsigned long elapsed_end = micros();
float delta = (elapsed_end - elapsed);
Serial.print("delayMicroseconds(");
Serial.print(d);
Serial.print(") => delta ");
Serial.print(String(delta / 1000000.0, 5));
Serial.print("uS, ");
elapsed = micros();
for (int i = 0; i < 1000; i++) {
DM1000(d);
}
elapsed_end = micros();
delta = (elapsed_end - elapsed);
Serial.print("with optimization ");
Serial.print(String(delta / 1000000.0, 5));
Serial.println("uS");
}
void loop() { } and here the results:
Looking at the "unoptimized" column (that I guess is the most common case) the precision of the bne version is not so striking against bge, but its still more precise. Now my original question: in the AVR core I see that delay is a non-inlined function, so we have to do all the tricks to bias the function call delays. Inlining the delay may help to simplify that? C |
1,00000 1,06679 0,06679 1,09665 0,09665 The relative error of BNE + CHECK versus BGE shows the difference more explicitely. BNEC has 25-30% less relative error than BGE Conclusion go for BNEC.
No inlining will not help, the idea of the uDelay is that you calculate the number of iterations for a very tight loop to mimic a 'clock'. For very small values of the delay calculating the number of iterations already takes too much time. So even inlined code will need 'tricks' (maybe less). Furthermore inlining means larger footprint and although this might be little, it adds up (in my own code, libraries etc). So for the AVR I propose not to inline, but add the check that tests for values of 0 and 1 before calculating the number of iterations. My current (AVR) implementation differs a bit of the proposal above to get a better timing for the low values. Note: the 20 and 8 MHz versions are (manual) optimized on a 16Mhz board and are not confirmed yet on a real board. /* Delay for the given number of microseconds. Assumes a 8 or 16 MHz clock. */
void delayMicroseconds(unsigned int us)
{
// calling avrlib's delay_us() function with low values (e.g. 1 or
// 2 microseconds) gives delays longer than desired.
// delay_us(us);
#if F_CPU == 20000000L
// for the 20 MHz clock on rare Arduino boards
// handle 0 and 1
if (us == 0) return;
__asm__ __volatile__ (
"nop" "\n\t"
"nop" "\n\t"
"nop" "\n\t"
"nop");
if (us == 1) return;
// calc # iterations
us = (us<<2) + us; // x5 us
// account for the time taken in the preceeding commands.
us -= 7;
#elif F_CPU == 16000000L
// for the 16 MHz clock on most Arduino boards
// handle 0 and 1
if (us == 0) return;
if (us == 1) return;
// calc # iterations
us <<= 2;
// account for the time taken in the preceeding commands.
us -= 5;
#elif F_CPU == 8000000L
// for the 8 MHz internal clock on the ATmega168
// handle 0, 1;
if (us < 2) return;
// handle 2;
if (us == 2) return;
// calc # iterations
us <<= 1;
// account for the time taken in the preceeding commands.
us -= 5;
// quarter uSecond adjust
__asm__ __volatile__ (
"nop\n\t"
);
#elif F_CPU == 1000000L
// TODO 1 MHz
#error "delayMicroseconds: 1 MHz not supported yet"
# else
#error "delayMicroseconds: not supported clockspeed"
#endif
// busy wait
__asm__ __volatile__ (
"1: sbiw %0,1" "\n\t" // 2 cycles
"brne 1b" : "=w" (us) : "0" (us) // 2 cycles
);
} 16MHz gives the following output: ( delayMicroseconds(0) => 0.81920 ==> inf Last column is the relative error (inf due to divide by zero) and it is decreasing which is good. The value for DMS(0) is as good as it gets. |
the issue has been fixed in this pull request #1678 |
see also - http://arduino.cc/forum/index.php/topic,132983.msg1000603.html#msg1000603
The delayMicroseconds function fails when called with a value of 0.
Although this is not done normally it can happen when called with a variable (result of some math). The bug can be traced back to the if (--us ==0) statement
BUG in 0.22 + 1.0.0 + 1.0.2
Find test program and patch below (the patch does not include the 1.0.2 20Mhz addition as I cannot test it)
Note: The patch is for up to 1.0.0 version.
Program to show bug
Find below a patch for the function: (0.22 - 1.0.0)
The text was updated successfully, but these errors were encountered: