Skip to content

Occasional MemoryError during Adafruit IO loop() #101

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
calcut opened this issue Feb 18, 2022 · 26 comments
Closed

Occasional MemoryError during Adafruit IO loop() #101

calcut opened this issue Feb 18, 2022 · 26 comments
Assignees

Comments

@calcut
Copy link
Contributor

calcut commented Feb 18, 2022

Adafruit CircuitPython 7.1.1 on 2022-01-14; Adafruit Feather ESP32S2 with ESP32S2

Circup freeze:
Found device at /Volumes/CIRCUITPY, running CircuitPython 7.1.1.
adafruit_datetime==1.1.6
adafruit_htu31d==1.1.3
adafruit_requests==1.10.6
adafruit_ticks==1.0.1
neopixel==6.2.4
adafruit_bus_device==5.1.4
adafruit_io==5.6.1
adafruit_logging==3.7.5
adafruit_minimqtt==5.2.1
adafruit_register==1.9.8

ERROR Traceback (most recent call last):
  File "mcu.py", line 304, in aio_receive
  File "adafruit_io/adafruit_io.py", line 233, in loop
  File "/lib/adafruit_minimqtt/adafruit_minimqtt.py", line 805, in loop
  File "/lib/adafruit_minimqtt/adafruit_minimqtt.py", line 854, in _wait_for_msg
  File "/lib/adafruit_minimqtt/adafruit_minimqtt.py", line 906, in _sock_exact_recv
MemoryError: memory allocation failed, allocating 4294953983 bytes

I'm running io.loop(timeout=0.1) in mcy.py, line 304
Normally this works fine, but every few minutes I get the above exception.
If I ignore the exception, the code continues to run with no obvious issues, but I wanted to report it anyway.

digging into adafruit_minimqtt.py, and adding a print(bufsize) shows me that the error on line 906 occurs when bufsize is less than zero. e.g. -13313. So I can see how the code ended up trying to allocate ~4GB of memory (2^32-13313)

I also notice that the bufsize immediately before the negative one is large and positive e.g. 13365

I could try harder to make a minimum reproducable example if needed.

@calcut
Copy link
Contributor Author

calcut commented Feb 18, 2022

for the avoidance of doubt, this error also occurs with the default io.loop(timeout=1)

@ladyada
Copy link
Member

ladyada commented Feb 20, 2022

thats odd, how could you get a request for a negative amount? but yeah, a reproducable example would help a ton!

@calcut
Copy link
Contributor Author

calcut commented Feb 20, 2022

code_dot_py.txt

Attached is my minimal example, it seems to be related to:

io.subscribe_to_time("seconds")

which means I receive a MQTT message every second.
Not certain its the volume of messages that means it is more likely to show up, or something buggy with those particular messages.

NB, io.subscribe_to_time("ISO-8601") also triggers the MemoryError, and those messages have a different content.

@calcut
Copy link
Contributor Author

calcut commented Feb 20, 2022

A bit more digging...
When the error occurs, I can see that topic_len = bytearray(b'01'), in adafruit_minimqtt.py line 843
topic_len = self._sock_exact_recv(2)
Typically it is bytearray(b'\x00\r')

Not sure what line 844 achieves (some sort of decoding?)
topic_len = (topic_len[0] << 8) | topic_len[1]

But in the error case, I have topic_len[0] = 48, and topic_len[1] = 49
so end up with topic_len = 12337

Finally... after line 848:
sz -= topic_len + 2

I have a negative sz of -12285, which leads to the error.

I don't understand the mechanics enough to know why this has gone wrong or how to fix.

@calcut
Copy link
Contributor Author

calcut commented Feb 20, 2022

If you are trying to reproduce... the error usually occurs every few minutes, but I have seen it go for ~6 hours with no errors

@calcut
Copy link
Contributor Author

calcut commented Feb 21, 2022

I ran a test overnight, subscribed to 8 random feeds, but NOT to time
e.g. io.subscribe_to_randomizer("2278")

I didn't see the MemoryError. Not exactly conclusive, but my guess is the issue is with the AIO server sending the time messages.

@calcut
Copy link
Contributor Author

calcut commented Feb 21, 2022

Well, I've got stuck into this now...

Here is some output with a print() in the _sock_exact_recv() function

Normal message:

received bytearray(b'0')
received bytearray(b'\x18')
received bytearray(b'\x00\x0c')
received bytearray(b'time/seconds')
received bytearray(b'1645437393')
seconds = 1645437393

Error Case:

received bytearray(b'\x18')
received bytearray(b'\x00')  <-- For some reason we only got 1 byte this time, when we should have 2
received bytearray(b'\x0c')
received bytearray(b't')
received bytearray(b'i')
received bytearray(b'm')
received bytearray(b'e')
received bytearray(b'/')
received bytearray(b's')
received bytearray(b'e')
received bytearray(b'c')
received bytearray(b'o')
received bytearray(b'n')
received bytearray(b'd')
received bytearray(b's')
received bytearray(b'1')
received bytearray(b'6')
received bytearray(b'45')
received bytearray(b'437394\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
...
...
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')

@calcut
Copy link
Contributor Author

calcut commented Feb 21, 2022

OK, this time I'm printing the number of bytes actually requested with print(f"{bufsize}, {rc}")
It explains why I got only 1 byte (b'\x00', from my previous comment).

It's now looking like I'm missing the first b'0', which throws off the sequence for everything else.

Normal Message:

1, bytearray(b'0')
1, bytearray(b'\x18')
TL  <-- a debug print to show we are about to request the 2 bytes for topic length
2, bytearray(b'\x00\x0c')
12, bytearray(b'time/seconds')
10, bytearray(b'1645446561')
seconds = 1645446561

Error Case:

       <---- I've noticed there is no bytearray(b'0') here, but there should be.
1, bytearray(b'\x18')
1, bytearray(b'\x00')
1, bytearray(b'\x0c')
1, bytearray(b't')
1, bytearray(b'i')
1, bytearray(b'm')
1, bytearray(b'e')
1, bytearray(b'/')
1, bytearray(b's')
1, bytearray(b'e')
1, bytearray(b'c')
1, bytearray(b'o')
1, bytearray(b'n')
1, bytearray(b'd')
1, bytearray(b's')
1, bytearray(b'1')
1, bytearray(b'6')
TL   <-- The 2 bytes for topic_length are not requested until here.
2, bytearray(b'45')
13365, bytearray(b'446562\x00\x00\x00\x00\x00\x
...

@calcut
Copy link
Contributor Author

calcut commented Feb 21, 2022

New idea, I'm encountering some sort of race condition by hitting the loop() function every 100ms, which happens to be the same timeout as def _wait_for_msg(self, timeout=0.1):

Trying with 200ms instead...

@calcut
Copy link
Contributor Author

calcut commented Feb 21, 2022

well, just had the error again with 200ms timeouts. Although it took a while (1 hour maybe)
I reckon I'm just going to catch and ignore the exception.

    if (time.monotonic() - timer_200ms) >= 0.2:
        timer_200ms = time.monotonic()
        io.loop(timeout=0.2)

@calcut
Copy link
Contributor Author

calcut commented Feb 22, 2022

Possibly not very helpful... but I've just had the same error while running io.get(feed) as opposed to io.loop().
This was during initialisation where I'm first subscribing to feeds

@brentru
Copy link
Member

brentru commented Feb 22, 2022

Instead of using the IO library (io.), could you try subscribing to your time feed via MiniMQTT instead?

@calcut
Copy link
Contributor Author

calcut commented Feb 24, 2022

OK, I could still see the error with just MiniMQTT. Code attached.
memory_error_mqtt_only.txt
However, I think fiddling with the timeout and timer can make a difference. Will report back when I have something more solid RE: timings.

Also interesting: I haven't managed to recreate the error on a 2nd ESP32-S2 Feather, even running exactly the same code. So a hardware corner case perhaps, always fun. Ignore that, I have now seen it on both

@calcut
Copy link
Contributor Author

calcut commented Feb 24, 2022

Its looking like calling mqtt_client.loop(timeout=0.1) every 200ms or faster can lead to the error.
At 300ms interval it is looking OK.
There isn't a particularly good reason for me to call it this often, other than I wanted a quick response while doing a live demonstration.

(The timeout=0.1 part doesn't seem to matter other than potentially limiting the overall rate due to blocking)

@brentru
Copy link
Member

brentru commented Mar 10, 2022

@calcut Thank you for doing performance analysis, this is good info! Do you still get a MemoryError by running io.loop(timeout=0.3), since you saw the issue is only occurring <= 200ms?

@calcut
Copy link
Contributor Author

calcut commented Mar 11, 2022

Realised I wasn't clear about that. I'm running another test with mqtt_client.loop(timeout=0.3)
So far no errors, but I did comment earlier:

for the avoidance of doubt, this error also occurs with the default io.loop(timeout=1)

So I'm not sure why I saw it before and not now. Perhaps I was mistaken previously.

@calcut
Copy link
Contributor Author

calcut commented Mar 11, 2022

I've now seen the MemoryError on both of my boards with mqtt_client.loop(timeout=0.3).
It usually takes longer to show up though, because the 0.3s is blocking therefore the function is called roughly every 300ms rather than 100ms.

code
memory_error_mqtt_LEDs.txt

My neopixel usually gets to MAGENTA but I've never seen it go YELLOW.

@calcut
Copy link
Contributor Author

calcut commented Sep 2, 2022

Just ran this again with the latest version of the MQTT library (5.5.0). No change (as expected). I'll have one more stab at figuring out what is happening

@calcut
Copy link
Contributor Author

calcut commented Sep 2, 2022

partly for my own sanity, this is what I think is happening in more detail

when things work as expected:
in adafruit_minimqtt.py, in the _wait_for_msg() function, we first request 1 byte.

We expect the response to be bytearray(b'0'), which is actually 0x30
This indicates there are more bytes to come.

In this example we expect 0x18 (i.e 24) more bytes, including the topic

we then request 2 more bytes to find out the topic length
And get 0x000c (12) in my example, corresponding to “time/seconds”

from there we can figure out that there are still (24 - 12 - 2 ) = 10 more bytes to come, and read those in to get the posix time. All good.

When it breaks
We expect the first response to be bytearray(b'0'), but we actually get 0x18.

So I think the server has tried to respond with bytearray(b'0'), but it was too slow or got lost and we timed out.
The server hasn't realised this so happily continues along the sequence and serves up 0x18 next time we get to the _wait_for_msg() function.

Looking at Timeouts
The loop() function has a timeout parameter that is not passed to _wait_for_msg().
so wait_for_msg() always has a timeout of 0.1s, however this is not applied until after the first byte has been received.
So actually we do have a 1s timeout (due to loop() funciton default) when waiting for bytearray(b'0')
But this is effectively blocking, so can make code unresponsive compared to a shorter timeout.

Comparing to my previous debug efforts
I didn't see the issue when I increased the interval between calling loop() to 0.3s.
This is probably because the server times out after about this time and does not try to "continue" the previous failed transaction.

Fix Attempt
I understand this could be a niche error, but I'm trying out a fix of adding a delay when the condition is detected to allow the server to timeout and start again.
The condition is already detected by the line if res[0] & 0xF0 != 0x30:
which checks that the first byte is bytearray(b'0') and then immediately returns if so.
Unfortunately even a time.sleep(2) added there does not prevent the issue, so I must have an assumption wrong somewhere!

@calcut
Copy link
Contributor Author

calcut commented Sep 2, 2022

Still testing... but an alternative solution may be to use non-blocking mode i.e. loop(timeout=0)
Not sure of the history, perhaps this was the intended mode, going by the docstring for loop() ?

    """Non-blocking message loop. Use this method to
    check incoming subscription messages."""

Anyway, this is currently broken because it can produce an EAGAIN exception (saying there is no data to receive)
I'm going by this reference:
https://github.com/adafruit/Adafruit_CircuitPython_HTTPServer/blob/ea9a2e0cc3754c151fc4472dda8856b3c9015d8c/adafruit_httpserver.py#L361

It looks like we could catch and ignore that exception in a similar way, like

def _wait_for_msg(self, timeout=0.1):
        """Reads and processes network events."""
        # CPython socket module contains a timeout attribute
        if hasattr(self._socket_pool, "timeout"):
            try:
                res = self._sock_exact_recv(1)
            except self._socket_pool.timeout:
                return None
        else:  # socketpool, esp32spi
            try:
                res = self._sock_exact_recv(1)
            except OSError as error:
                if error.errno == errno.ETIMEDOUT:
                    # raised by a socket timeout if 0 bytes were present
                    return None
                if error.errno == errno.EAGAIN:
                    # there is no data available right now, try again later.
                    return None
                raise MMQTTException from error

I'll do a PR next week if all is looking well

@sgbaird
Copy link

sgbaird commented Oct 4, 2022

@calcut wondering if you could have a look at adafruit/circuitpython#6988

@framework-circuitpython

Not sure if I should open another issue or not. It felt like this should go here but it feels like the issue goes deeper than this:

Using loop() in 8.0.0-beta always resulted in a fatal error as error.errno returned -116 which is not caught by ETIMEDOUT or EAGAIN:

def _wait_for_msg(self, timeout=0.1):
        """Reads and processes network events."""
        # CPython socket module contains a timeout attribute
        if hasattr(self._socket_pool, "timeout"):
            try:
                res = self._sock_exact_recv(1)
            except self._socket_pool.timeout:
                return None
        else:  # socketpool, esp32spi
            try:
                res = self._sock_exact_recv(1)
            except OSError as error:
                ### error.errno returns -116, this does not match either ETIMEDOUT or EAGAIN###
                if error.errno == errno.ETIMEDOUT:
                    # raised by a socket timeout if 0 bytes were present
                    return None
                if error.errno == errno.EAGAIN:
                    # there is no data available right now, try again later.
                    return None
                raise MMQTTException from error

My short term solution was to abs(error.errno) but like I said, it feels like this goes deeper than this library.

@dhalbert
Copy link
Contributor

dhalbert commented Oct 18, 2022

This seems like it might be a circuitpython core bug. self._sock_exact_recv() calls self._sock.recv_into() in the underlying code. I will check the exceptions raise from that.

@dhalbert
Copy link
Contributor

@framework-circuitpython Please try 8.0.0-beta.2 or later. There was an errno error fixed for Pico W in beta.2: adafruit/circuitpython#7048.

@vladak
Copy link
Contributor

vladak commented Jan 12, 2023

@calcut dunno about the other problems, but I believe the negative allocation reported initially should be fixed in the next release (PR #133) - the one that comes after 7.1.0, that is.

@brentru
Copy link
Member

brentru commented Apr 24, 2023

I've tested a modified version of the code posted above by Calcut on Adafruit CircuitPython 8.0.5 on 2023-03-31; Adafruit FunHouse with ESP32S2. I am not getting the same error, a MemoryError is not raised.

Looks to be resolved by adafruit/circuitpython#6988 and #133

REPL Log:

time/seconds = 1682355661
....
time/seconds = 1682356603
time/seconds = 1682356604
time/seconds = 1682356605
time/seconds = 1682356606
time/seconds = 1682356607

Code:

# Code via https://github.com/adafruit/Adafruit_CircuitPython_MiniMQTT/issues/101#issuecomment-1046331006
import os
import time
import wifi
import ssl
import socketpool
import adafruit_minimqtt.adafruit_minimqtt as MQTT
from adafruit_io.adafruit_io import IO_MQTT


def message_callback(client, feed_id, payload):
    print(f"{feed_id} = {payload}")

wifi.radio.connect(os.getenv('CIRCUITPY_WIFI_SSID'), os.getenv('CIRCUITPY_WIFI_PASSWORD'))

pool = socketpool.SocketPool(wifi.radio)

# Initialize a new MQTT Client object
mqtt_client = MQTT.MQTT(
    broker="io.adafruit.com",
    username=os.getenv('aio_user'),
    password=os.getenv('aio_pass'),
    socket_pool=pool,
    ssl_context=ssl.create_default_context(),
)

mqtt_client.connect()
mqtt_client.on_message = message_callback
mqtt_client.subscribe("time/seconds")

timer = 0
interval = 0.1

while True:

    if (time.monotonic() - timer) >= interval:
        timer = time.monotonic()

        try:
            mqtt_client.loop(timeout=0.3)
        except MemoryError as e:
            interval += 0.1
            print(f"interval = {interval}")
            if interval == 0.2:
                print("RED")
            elif interval == 0.3:
                print("MAGENTA")
            elif interval == 0.4:
                print("YELLOW")
            elif interval == 0.5:
                print("CYAN")
        except Exception as e:
            print(e)

Closing this out as it seems to be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants