Skip to content

is free() async in ESP8266 Arduino core? #7089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mamama1 opened this issue Feb 17, 2020 · 10 comments
Closed

is free() async in ESP8266 Arduino core? #7089

mamama1 opened this issue Feb 17, 2020 · 10 comments

Comments

@mamama1
Copy link

mamama1 commented Feb 17, 2020

Hi,

i'm wondering whether free() is async in the Arduino core of the ESP8266 and if it may encounter issues in my special case. As my code consists of tons of classes, cpp and h files, I'll try to first describe the issue here, maybe the root cause will seem completely obvious to you core developers.
If we need to, we can always dig further down.

So I'm having a globally declared vector called "tasks" of a custom struct called "task" of which one of the items is a char *ptr.

The vector is basically a list of tasks which are queued in a callback function of the AsyncWebserver made by @me-no-dev and which then are processed by a processTasks() function which is called in the main loop().

To create a new task, I call my createTask function where I create a task newTask (in the stack) struct element and the I do a malloc for the len of the incoming data buffer from the AsyncWebServer (about 200 bytes, depends on incoming data). The malloc pointer is saved to the char *ptr of the above mentioned, new struct element called "newTask" (which lives in the stack).

When a task is finished during running processTasks(), I do free(tasks.back().ptr) and then tasks.pop_back();

During the free() operation, I sometimes (quite often actually, but absolutely not always) various exceptions. These include exception 3, 9, 28 and/or 29.

The fact that the exception does not occur always and that it depends (often) on how fast I spam the ESP8266 with new tasks, makes me think that the free() might happen later, after the task has been popped from the vector.

So I tried the following, which didn't help either:

char *savePtr;
savePtr = tasks.back().ptr;
tasks.pop_back();
free(savePtr);

When I stop using free, and therefore create a memory leak, the exceptions/crashes stop immediately. I'm guessing that there is happening a free() operation to a place in memory where free() is not possible because either the pointer gets somehow invalid or maybe because there is a bug/design flaw in the malloc/free routines of ESP8266?

I noticed that @me-no-dev implemented a custom LinkedList for a similar task like I have but my brain was about to explode when I tried to wrap it around that sick stuff :-D I suppose there might have been a very good reason for @me-no-dev to implement that LinkedList to accomplish this very similar to mine task (add tasks to the back, pop them from the front). I must say I was using a deque before the vector but behaviour was exactly the same.

Arduino Core Version is the most current (2.6.3), IDE is PlatformIO. I'm using LWIP2_LOW_MEMORY, no other build parameters.

The trace using the exception decoder always pointed around the free() operation. And as mentioned, the ESP stops crashing as soon as I stop freeing the buffer.

Some snippets - this is the struct, I'm talking about:

typedef struct WebSocketServerTask
	{
		eWebSocketCommand_t CommandID;
		uint32_t ClientID;
		char *Data;
	} WebSocketServerTask_t;

This is the addTask function:

void WebSocketServer::AddTask(eWebSocketCommand_t command, uint32_t clientID, uint8_t *data, size_t len)
{
	char *pDataBuffer = NULL;

	if (WebSocketServerTasks.size() > WEBSOCKET_TASKS_COUNT_MAX)
	{
		Serial.printf_P("[ERROR] Too many active tasks. Max allowed: %d\r\n", WEBSOCKET_TASKS_COUNT_MAX);
		return;
	}

	if (len > JSON_STRING_LENGTH_MAX - 1)
	{
		Serial.printf_P("[ERROR] JSON string too long! Length is: %d max: %d\r\n", len, JSON_STRING_LENGTH_MAX);
		return;
	}

	if (len > 0)
	{
		// Allocate memory for WebSocket data
		pDataBuffer = (char *)malloc(len + 1);	// + 1 makes space for the \0 character
		
		// Great success?
		if (pDataBuffer == NULL)
		{
			Serial.printf("[ERROR] Could not allocate memory for WebSocket data in function: %s\r\n", __FUNCTION__);
			return;
		}

		// Copy WebSocket data to the new buffer
		memcpy(pDataBuffer, data, len);
		pDataBuffer[len] = '\0';
	}

	// Create the Task
	WebSocketServerTask_t newTask;
	newTask.CommandID = command;
	newTask.ClientID = clientID;
	newTask.Data = pDataBuffer;
	
	// Add the Task to the queue	
	WebSocketServerTasks.push_back(newTask);

	#ifdef WS_DEBUG
		WebSocketServer::PrintTasks();
	#endif
}

This ist the processTasks where the malloced buffer gets freed again:

void WebSocketServer::ProcessTasks(void)
{
	bool status = true;
	size_t n = 0;
	size_t count = 0;

	char buffer[WEBSOCKET_RESPONSE_BUFFER_LENGTH] = {0, };
	char WebSocketResponseBuffer[WEBSOCKET_RESPONSE_BUFFER_LENGTH] = {0, };
	
	// Is there a Task to process?
	if (WebSocketServerTasks.empty())
		return;

	switch (WebSocketServerTasks.front().CommandID)
	{
		case eWebSocketCommand::GET_VERSION:
			break;
			
		case and so on...

		default:
			Serial.printf("[ERROR] Unsupported CommandID: %d\r\n", WebSocketServerTasks.front().CommandID);
			break;
	}

	Serial.printf("WebSocketResponseBuffer: %s\r\n", WebSocketResponseBuffer);

	// Send response to the client
	WebSocket.text(WebSocketServerTasks.front().ClientID, WebSocketResponseBuffer, strlen(WebSocketResponseBuffer));
	
	// Free the memory that was occupied by the data of this task
	free(WebSocketServerTasks.front().Data);

	// Remove this task from the queue
	WebSocketServerTasks.pop_front();

	#ifdef WS_DEBUG
		if (WebSocketServerTasks.empty()) Serial.printf("[INFO] All WebSocket tasks processed.\r\n");
	#endif
}

Any hints?

Thanks!

@earlephilhower
Copy link
Collaborator

No, free() is not asynchronous. You very likely have a use-after-free error, or buffer overflow, in your code, so that the memory structures that UMM uses to track blocks gets corrupted, and it causes a crash later on when UMM is being called to free() something..

This is not the forum to debug this, instead please use https://esp8266.com or https://gitter.im/esp8266/Arduino.

@mamama1
Copy link
Author

mamama1 commented Feb 17, 2020

Thanks for your response. I'm afraid this is related to LWIP2 as the issue does not occur when I switch to LWIP1 (just tested that, because in one of the stack traces which I decoded there was only LWIP stuff going on, no sign of any of my code anywhere).

I'm continuously monitoring free heap and the number of queued tasks, around 30k of heap was free all the time.

Currently, with LWIP1, im flooding the ESP with 8 WebSocket requests every 25ms(!) without any crashes. Heap decreases to about 26k and recovers to 30k as soon as I close the WebSocket client (which is a JavaScript client in the browser). Max. 1-2 tasks backlog builds up so the ESP can keep up processing the queued tasks without any issues.

With LWIP2, the ESP crashes within seconds. The stack trace differs from one crash to the other, as mentioned most of the time it is exception 3 but sometimes it is exception 28 but only when the thing locked up and the watchdog had to kick in and reset it.

For me the solution for now is to use "PIO_FRAMEWORK_ARDUINO_LWIP_HIGHER_BANDWIDTH" instead of "PIO_FRAMEWORK_ARDUINO_LWIP2_LOW_MEMORY" but I'd be glad to help tracking this issue down, if you are interested.

Let me know if you think that this could still be related with my code, even if LWIP1 fixes it for me. In that case I'd agree that it wouldn't make any sense to discuss it further at this place.

@d-a-v
Copy link
Collaborator

d-a-v commented Feb 18, 2020

@mamama1
Can you try to use scheduled functions from inside async web server callbacks ?
I'd try with both a regular "scheduled function" and a "recurrent schedule function" returning false, like this:

void xx::mycallback ()
{
 this->domystuff();
}

to:

#include <Schedule.h>
void xx::mycallback ()
{
   schedule_function([&]() { this->domystuff(); });
   // or: schedule_recurrent_function_us([&]() { this->domystuff(); return false; }, 0);
}

That way, free() wouldn't be called in ISR context nor lwIP callback.

@mamama1
Copy link
Author

mamama1 commented Feb 18, 2020

@d-a-v
my free() runs in a function processTasks() which is called solely from the main loop(), am I mistaken, that this isn't neither ISR context nor lwIP callback?

in the asyncwebserver callback (which is isr context, I believe?) I queue the task which is then processed in the processTasks() function like mentioned before...

Am I missing something?

@d-a-v
Copy link
Collaborator

d-a-v commented Feb 18, 2020

I think you got the interest of a scheduled function: defer the execution of a particular function so it is executed in the main context.
Then why not trying, in the asyncwebserver callback, to use a scheduled function to ("domystuff()"=) create/queue the new task?

@mamama1
Copy link
Author

mamama1 commented Feb 18, 2020

I will have to look up what Schedule.h and a scheduled function actually is, never heard about that. Also, I'm not sure I 100% understand what you mean (I'm not english native). My callback looks like this (sorry, forgot to post it earlier). so you saying using a scheduled function for WebSocketServer::AddTask(eWebSocketCommand::GET_WIFI, client->id(), TempData, len); would make a difference? Within AddTask only the malloc happens...

void WebSocketServer::EventHandler(AsyncWebSocket *socket, AsyncWebSocketClient *client, AwsEventType type, void *arg, uint8_t *data, size_t len)
{
	StaticJsonBuffer<500> jsonBufferRequest;
	JsonObject& root = jsonBufferRequest.parseObject((char *) data);

	if (!root.success()) 
	{
		return;
	}
	else
	{
		const char *cmd = root["cmd"];

		if (cmd) 
		{
			if (strcmp(cmd, "getwifi") == 0)
			{
				WebSocketServer::AddTask(eWebSocketCommand::GET_WIFI, client->id(), TempData, len);
				return;
			}
			...
		}
	}
}

Thanks for helping!

@d-a-v
Copy link
Collaborator

d-a-v commented Feb 18, 2020

like free(), malloc() is not async. That may not be the issue anyway.
As you know we are not in an RTOS and context from ISR and lwIP callback should not execute complex/system or long(json parsing) functions. That's why we are using scheduled functions on nonos-SDK.
I reckon there's not much documentation about them (or am I mistaken?)

Also, I don't think lwip2 is the real issue (I am not the most credible one on that).
lwip2 is more complex than lwip1 (does more malloc/free because it is an adaptation layer). It may not cope well with frequent unprotected requests. The fact it is working with lwip1 might just be luck. Schedule functions might keep things more fluent for both of them.

void WebSocketServer::EventHandler(AsyncWebSocket *socket, AsyncWebSocketClient *client, AwsEventType type, void *arg, uint8_t *data, size_t len)
{
        schedule_function([socket, client, type, arg, data, len]() // <- not sure about the syntax
{
	StaticJsonBuffer<500> jsonBufferRequest;
	JsonObject& root = jsonBufferRequest.parseObject((char *) data);

	if (!root.success()) 
	{
		return;
	}
	else
	{
		const char *cmd = root["cmd"];

		if (cmd) 
		{
			if (strcmp(cmd, "getwifi") == 0)
			{
				WebSocketServer::AddTask(eWebSocketCommand::GET_WIFI, client->id(), TempData, len);
				return;
			}
			...
		}
	}
});
}

@mamama1
Copy link
Author

mamama1 commented Feb 18, 2020

I'll try that, thanks! In general I'm aware of the fact that one shouldn't do any lengthy tasks in the system context. I was just completely stunned because I had the code like it is for months, while working on other parts of the firmware and I had never ever issues with all the JSON parse stuff (I'm aware that the JSON stuff takes quite some time), even when completely flooding the device with requests.

The issues began as soon as I started to not complete the whole task directly in the callback (oddly that worked perfectly fine but was ugly and kind of spaghetti code) but when I started to queue a task using AddTask where a malloc() happens and later during processTasks() where the free() happens.
Oddly, the crashes stopped immediately when I stopped to free() the allocated buffer. What I absolutely do not understand is, that free() does NOT happen in the system context but during execution of the main loop(). At least that is my understanding of things.

So no manipulation within AddTask() (which runs in system context, to my understanding) changed anything regarding the crashes. Only manipulation within processTasks() changed whether the ESP would crash or not.

I'll definitely shorten down the stuff which runs during the callback, I'll also try to schedule it like you showed me, that's for sure. Already just because of the fact that it is good practice anyway and should have be done from the beginning. Still I can't wrap my head around why the thing stopped crashing when I stopped freeing the buffer in the user context... (also, as I said, it didn't crash always but only sometimes, so it was not a "hard" error).

edit: I've heard for the first time about scheduled functions, I guess documentation is really sparse on that..

@d-a-v d-a-v mentioned this issue Feb 19, 2020
6 tasks
@TD-er
Copy link
Contributor

TD-er commented Feb 19, 2020

Oddly, the crashes stopped immediately when I stopped to free() the allocated buffer. What I absolutely do not understand is, that free() does NOT happen in the system context but during execution of the main loop(). At least that is my understanding of things.

Sounds a bit like something that's still processing your data did not make a copy of it, but just stored the pointer to it.
If data is freed, its allocated memory is marked free. But that doesn't mean it is wiped and re-allocated for different purposes.
So if you still have a pointer to that data, it might be the data itself is still there until that memory is being used for something else.
It will be somewhat random how long it takes for that data to be corrupted or occupied by some other kind of object which will throw an exception when you try to write to it.

@philbowles
Copy link

@mamama1 You might want to look at my H4 library which does pretty much what your code does (i.e. its a main loop scheduler / timer) , but for all async tasks, from anywhere, not just Asyncwebserver. It's also stdlib-based and has proven pointer- and crash-free for well over a year now: https://github.com/philbowles/h4plugins

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants