Skip to content

WifiClient.connect() crashes when connect fails and in the mean time an io interrupt occured. #2181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pgScorpio opened this issue Jun 22, 2016 · 13 comments

Comments

@pgScorpio
Copy link

Basic Infos

Hardware

Hardware: ESP-12
Core Version: 2.2.0

Description

Problem description

1: An io interrupt handler is attached to an io pin, (in my case RcSwitch, but any io interrupt on any pin causes the same problem)

2: Trying to connect a WiFiClient to a non existing server (server is unreachable/off-line)

Problem:
If during the connect attemp any io interrupt(s) occur the esp crashes at the moment the connect attempt times-out.

Fatal exception 0(IllegalInstructionCause):
epc1=0x40216b74, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000

Exception (0):
epc1=0x40216b74 epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000000 depc=0x00000000

ctx: sys 
sp: 3ffffc10 end: 3fffffb0 offset: 01a0

epc1=0x40216b74 ==> Is digitalRead (called from io interrupt).

(I did not yet check if any other interrupts, like a timer, causes the same problem, but what i noticed is that it always crashes on a digitalRead called from the interrupt.)

And also when the connect is succesfull it doesn't crash (but in this case the time interval is much shorter and so the change of having an interrupt during the connect is also smaller..., but even with a remote pressed during successful connecting, interrupting almost ten thousand times a second, I couldn't get it to crash... )

   if (_client.connect(host.c_str(), hostPort)) //crashes when interrupted AND connect fails....
   {
   }

When during the connect attempt IO interrupts are disabled it won't crash...

ETS_GPIO_INTR_DISABLE();
    _client.connect(host.c_str(), hostPort);    //no crash.
ETS_GPIO_INTR_ENABLE();
   if (_client.connected())
   {
   }

@igrr
Copy link
Member

igrr commented Jun 23, 2016

ISR handlers have to be marked with ICACHE_RAM_ATTR.
There is a ticket open already to document this: #1388.

@pgScorpio
Copy link
Author

In my case they are !
Note that it is only while a connect fails. In all other cases the interrupts work flawlessly.

As far as i can see it crashes before either _s_connected or _s_error is called. (Interrupts re-enabled or leaving nmi ??)

@igrr
Copy link
Member

igrr commented Jun 23, 2016

Same goes for any functions you may be calling from ISR, make sure those are in RAM as well.

You can check where it crashes — do objdump of the output ELF file and look for 0x40216b74 (the address from your serial output).
Or just use stack dump decoding tool. https://github.com/esp8266/Arduino/blob/master/doc/Troubleshooting/stack_dump.md

@pgScorpio
Copy link
Author

I used the stack dump decoding tool.....
And as stated in my first post: epc1=0x40216b74 ==> Is digitalRead
and digitalRead is in ram....

@igrr
Copy link
Member

igrr commented Jun 25, 2016

Could you please upload the elf file from that sketch somewhere and share a link to it?
Also please turn on verbose compilation output in the IDE (in Preferences/Settings window) and paste full compilation log here.
If digitalRead ends up in 0x402xxxxx region then something is going terribly wrong with sketch compilation.

@pgScorpio
Copy link
Author

Hmmmm.... You are right @iggr, I should have noticed that...
Unfortunately I can't reproduce that exact version. I made a backup of the sketch, but forgot to backup my libraries as well... and they are WIP at the moment (adding async communication)
As soon as i have all working again i'll try to reproduce the error and provide you with elf and log files.

@pgScorpio
Copy link
Author

@igrr
I have a new version working and I can still reproduce the problem.

All files can be found here.

The problem is a bit different as it seemed.

The address should be digitalRead, as it is a inline inline call to digitalRead,

Check the code-snippets: It actually is an inline calling an inline and the first inline (cIoPin::input()) compiles really inline, but the second inline (cIoPin::getInput()) compiles as a function in flash...

And if we take a good look at the stackdump we can see another problem:
The pin interrupt is interrupted by sys_check_timeouts, this while interrupts are disabled during a pin interrupt. So either sys_check_timeouts must be a NMI or it is caused be the func-in-flash call (since at this point we have entered the interrupt handler and the first call would be cIoPin::Input()) Does reading code from flash enable interrupts ?
Nevertheless it seems that this causes the pininterrupt to fire again while already handling a pininterrupt...(since the pinInterrupt_4 return address is on stack twice)

Serial Dump:

Fatal exception 0(IllegalInstructionCause):
epc1=0x40215c4c, epc2=0x00000000, epc3=0x00000000, excvaddr=0x00000000, depc=0x00000000

Exception (0):
epc1=0x40215c4c epc2=0x00000000 epc3=0x00000000 excvaddr=0x00000000 depc=0x00000000

ctx: sys 
sp: 3ffffc10 end: 3fffffb0 offset: 01a0

>>>stack>>>
3ffffdb0:  3fff1618 3fff0554 3fff1618 00000020  
3ffffdc0:  00000000 00000001 00000010 40106f44  
3ffffdd0:  3fff07fc 80dd9999 00000010 4010765e  
3ffffde0:  c0037015 00000001 00000001 ffffff9a  
3ffffdf0:  00000006 00000006 42d80000 00000022  
3ffffe00:  3fffc200 401075fc 3fffc258 4000050c  
3ffffe10:  40004378 00000030 00000016 ffffffff  
3ffffe20:  60000200 00000008 077f077f 80000000  
3ffffe30:  20000000 3fff74a0 80000000 203fc0a0  
3ffffe40:  80000000 3fffc6fc 3fff1618 3fff74a4  
3ffffe50:  00000254 003fc0a0 60000600 00000030  
3ffffe60:  4024c898 00000001 00000010 40106f44  
3ffffe70:  4024c898 00000032 00000138 4010765e  
3ffffe80:  c0037005 00000001 00000001 00000000  
3ffffe90:  00000000 00000000 0000001f 401050f1  
3ffffea0:  4000050c 401075fc 3fffc258 4000050c  
3ffffeb0:  40000f68 00000030 0000001c ffffffff  
3ffffec0:  40000f58 00000000 00000020 00000000  
3ffffed0:  4024c654 4021f8a0 0000004b 0000012c  
3ffffee0:  000000fa 3ffec274 3fff1618 3fffdab0  
3ffffef0:  00000000 3fffdcb0 3fff1668 00000030  
3fffff00:  00000000 400042db 3ffed3ab 60000600  
3fffff10:  40004b31 3fff7404 000002f4 003fc000  
3fffff20:  401054da 3fff1640 3fff06e0 40107810  
3fffff30:  4021f371 3fff06e0 3fff1640 00d385da  
3fffff40:  3fff7404 00001000 4021f806 00000008  
3fffff50:  4024c8d8 3ffec274 4021f8b3 3fff0794  
3fffff60:  3fff1640 0041f2d0 3fff1618 3fff1640  
3fffff70:  40230f6d 3fff0794 3fff1640 00d350cd  
3fffff80:  40230fb2 3fffdab0 00000000 3fffdcb0  
3fffff90:  3fff1650 00000000 40000f65 3fffdab0  
3fffffa0:  40000f49 000181b9 3fffdab0 40000f49  
<<<stack<<<


Exception Decoder:

Decoding 32 results
0x40215c4c: cIoPin::getIput() const at H:\Users\Peter\Documents\Arduino\esp8266\libraries\pgoBitProtocol/pgoBitProtocol.cpp line 256
0x40215c4c: cIoPin::getIput() const at H:\Users\Peter\Documents\Arduino\esp8266\libraries\pgoBitProtocol/pgoBitProtocol.cpp line 256
0x40106f44: pinInterrupt_4 at H:\Users\Peter\Documents\Arduino\esp8266\libraries\pgoArduino/pgoIoPins.h line 695
0x4010765e: interrupt_handler at C:\Users\Peter\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.2.0\cores\esp8266/core_esp8266_wiring_digital.c line 175
0x401075fc: interrupt_handler at C:\Users\Peter\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.2.0\cores\esp8266/core_esp8266_wiring_digital.c line 153
0x4024c898: sys_check_timeouts at ?? line ?
0x40106f44: pinInterrupt_4 at H:\Users\Peter\Documents\Arduino\esp8266\libraries\pgoArduino/pgoIoPins.h line 695
0x4024c898: sys_check_timeouts at ?? line ?
0x4010765e: interrupt_handler at C:\Users\Peter\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.2.0\cores\esp8266/core_esp8266_wiring_digital.c line 175
0x401050f1: ets_timer_disarm at ?? line ?
0x401075fc: interrupt_handler at C:\Users\Peter\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.2.0\cores\esp8266/core_esp8266_wiring_digital.c line 153
0x4024c654: tcp_zero_window_probe at ?? line ?
0x4021f8a0: pm_get_sleep_type at ?? line ?
0x401054da: spi_flash_read at ?? line ?
0x40107810: pvPortZalloc at C:\Users\Peter\AppData\Local\Arduino15\packages\esp8266\hardware\esp8266\2.2.0\cores\esp8266/heap.c line 33
0x4021f371: pm_set_sleep_time at ?? line ?
0x4021f806: pm_get_sleep_type at ?? line ?
0x4024c8d8: sys_check_timeouts at ?? line ?
0x4021f8b3: pm_get_sleep_type at ?? line ?
0x40230f6d: ets_timer_handler_isr at ?? line ?
0x40230fb2: ets_timer_handler_isr at ?? line ?

@igrr
Copy link
Member

igrr commented Jun 26, 2016

You are reading too much into this "trace". Recursive interrupt calls are not possible in this case, what happens is that you happen to have return address of interrupt_handler on the stack in multiple places (from several subsequent calls). As you know, stack frame is not cleared or zeroed out or anything like that, so there are lots of junk values in there. ESPExceptionDecoder tries to interpret anything that looks like a function address, including values which bear no relation to the current stack frame. If you want to see a real stack trace, you need to use GDB (either with GDBStub library or openocd).

Now back to the real issue. Compiler is not obliged to inline a member function marked as inline — this keyword is merely a hint to the compiler. Usually compiling with -Os happens to go against inlining, so no big surprise that a separate function is generated. I would suggest to place function into IRAM explicitly, without relying on inlining. Alternatively, you may try __attribute__((always_inline));, no idea if that will work.

@pgScorpio
Copy link
Author

pgScorpio commented Jun 26, 2016

Sorry iggr, but that 'junk' in memory is below the current stackpointer and does not show in the stackdump ! What you see in the stackdump is what is currently pending. (return addresses and local function data). So, since the return address of the handler->fn() call in wiring_digital interrupt_handler (40106f44) is twice on stack, and I even recognise my interruptHandlers local data, recursive interrupt calls are happening in this case.

Also i don't agree on that a compiler may ignore inline, unless you tell it explicitly to ignore or optimize inlining and in the last case it should only change inline code to functions if 1: the inline code is substantial and 2: it is called multiple times. Both are not the case here....
The ignore is apparently also not the case since some inlines are actual inlined....
So it could be that optimization is enabled (-Os), but in this case the compiler does not do a great job since creating a function just to call a function makes no sense, it generates more code....
Moreover it generates and calls a function in FLASH while both caller and callee are in RAM !
And if I look at the sizes of the functions I even get the idea it generates a FLASH copy of digitalRead instead of just calling it. (getIput is 0x35 bytes, digitalRead is 0x34 bytes, just a call would be far less bytes)

Anyway I will try __attribute__((always_inline))....

Thanks for your advice,

@pgScorpio
Copy link
Author

__attribute__((always_inline)) works !! No more function calls for inlines, no more crashes....

But still I wonder why it should not be possible to call a flash function from an interrupt...
It is not wise of course, because it's time consuming, but I think the system should not crash.
I still have the idea that calling the flash function might cause the recursive interrupts and these cause the crash somehow.
My first guess is that the copy flash-to-ram routine messes with interrupt enables, thereby enabling interrupts that where disabled....
In combination with unintended calls to flash functions (because of compiler generated flash functions as in my case here) this might explain why we see so many crash problems on the esp8266.

For the same reason the use of noInterrupts(), interrupts() should be forbidden (and should, in my opinion, be removed from the library or changed to savedPS = xt_rsil(15); ... xt_wsr_ps(savedPS); functionality !)

Sure worth to look into these matters......

@igrr
Copy link
Member

igrr commented Jun 27, 2016

Regarding gcc behaviour w.r.t. inlining, i think you might be better of discussing this with someone who actually knows a thing or two about gcc. I'm not a gcc developer so I can not comment on optimization decisions which gcc does. This project simply uses a toolchain which is available.

You probably didn't understand my comment regarding junk on the stack frames. Let me try to explain again. Stack frames are not "dense", i.e. when the compiler allocates 8 words for a stack frame it doesn't mean that each and every word will have a value written into it. Some words may stay unused, and will have whatever values they had previously, essentially being "junk" values. This is even more obvious with interrupt handlers, which always have extra 0x100 bytes of unused stack space allocated in prologue.

Consider a program which has three functions, a, b, and c, suppose each requires 0x10 bytes of stack. There is also an ISR handler routine, h, and L1 handler which calls it.
Look at what happens with the following sequences of calls.

  1. function a runs, then ISR occurs.
ISR frame (0x10)
L1 handler frame (0x10)
L1 handler (0x10)
Uninitialized space(0x100)
`a` frame(0x10)
----------- stack bottom (address space top)
  1. Some time later, function b runs, called from a. At this moment, interrupt occurs.
ISR frame(0x10)
L1 handler frame (0x10)
Old ISR frame(0x10)                       \  
Old L1 handler(0x10)                       | -  0x100 bytes of unitialized space in front of L1 handler frame
More uninitialized space (0xe0)           / 
`c` frame (0x10)
`b` frame (0x10)
`a` frame (0x10)
----------- stack bottom (address space top)

If you were to dump the stack in case 2, and then decode every instruction-address-like value into code line, you would see PC of ISR appearing twice on the "stack trace". Of course, such list of decoded instruction-address-like values is not a stack trace. To get a real stack trace, you need to use GDB.

Next thing, on the flash caching. Execution of code from flash happens via a bit of hardware which can be enabled or disabled. There are no software routines involved (what you describe as "copy from flash to RAM"). Different parts of WiFi stack can disable flash caching for short periods of time, because of "Reasons" (TM). If the ISR happens at this time, there is no way to execute code from flash because the cache is disabled. If you really want your handler to be in flash, you can write a wrapper around your ISR handler to check if cache is enabled and bail out if it isn't, setting a flag somewhere to run the handler later.

@pgScorpio
Copy link
Author

Thanks for the explanation.

I know a lot, but I don't know that much about esp8266 and gcc internals.
Always learning, even after 35 years of experience...
Unfortunately it's very hard to find GOOD documentation on these items. Do you know where to find them ?

@igrr
Copy link
Member

igrr commented Jun 27, 2016

On GCC end, the usual set of online docs for gcc project.
As for the esp8266, there is no good documentation on internals. Just bits and pieces you find on esp8266.com, in SDK dissasembly, in the esp8266 reverse engineering wiki, in issue discussions of this project and esp-open-sdk.

I'm closing the issue since the ISR problem is resolved for you.

@igrr igrr closed this as completed Jun 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants