Skip to content

"Batch" of SQS messages above 10 count causes problems during cleanup since boto3.client('sqs').delete_message_batch is limited to 10. #824

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
whardier opened this issue Nov 15, 2021 · 8 comments
Labels
bug Something isn't working p1

Comments

@whardier
Copy link
Contributor

"Batch" of SQS messages above 10 count causes problems during cleanup since boto3.client('sqs').delete_message_batch is limited to 10. There is a hard limit of 10 when utilizing delete_message_batch that is not properly documented in boto3 (whereas other batch methods do show 10). There is also no documentation reflecting a limit of 10 when implementing the sqs batch processor provided in this project. Infact the SQS documentation itself does not mention a limit - https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_DeleteMessageBatch.html but the return from this function when attempting to delete more than 10 messages is inline with errors for other messages when the records are greater than 10 (and those are properly documented).

I don't believe this is a documentation related issue - however - and this was simply an oversight where > 10 was expected to "just work".

What were you trying to accomplish?

I typically limit between 10-20 messages when performing longer running record processing functions and set my timeout for my lambda appropriately. I've toyed with making a toggle to flush deletions one by one (by only processing one record as a list at a time) for these sorts of tasks and think that is something that may be appropriate some day.

I also have (had) a limit of 1000 or 30 seconds for tracking data (route changes in frontend app) that is aggregated and spooled to AWS timestream. Processing time for 1000 records is typically less than 4 seconds... however I wasn't properly deleting these messages and had accidentally squashed the exception that would have told me there were some limits.

Expected Behavior

delete_message_batch isn't documented for having a 10 record id limit.. Ideally it would have deleted more than 10 at a time.

Current Behavior

delete_message_batch properly causes an exception to come back if there are more than 10.

Possible Solution

I've implemented a patch to allow arbitrary amounts of records to be spooled to delete_message_batch.

Currently I limit processing to 10 records at a time (while also allowing for more than 10) as a pattern (when processing batches can be done this way without causing an upstream issue). It IS nice that later batches can fail while earlier successful ones are properly cleaned.

May need to rethink documentation and examples to properly show supporting more than 10 records per event.

Steps to Reproduce (for bugs)

N/A

Environment

  • aws-lambda-powertools==1.20.2
  • python==3.8
@whardier whardier added bug Something isn't working triage Pending triage from maintainers labels Nov 15, 2021
@whardier
Copy link
Contributor Author

#818

@heitorlessa
Copy link
Contributor

hey @whardier with this new launch we won't need this anymore, and it'll be less error prone since deleting a large number of messages could catch customers off guard if their timeouts aren't adjusted accordingly.

https://aws.amazon.com/about-aws/whats-new/2021/11/aws-lambda-partial-batch-response-sqs-event-source/

@whardier
Copy link
Contributor Author

Actually there is still an issue at hand where the batch deletion fails and the entire event breaks.

I am not sure if AWS lambda powertools will instead insist on a different return for the event that includes the failed messages. If so.. cool.. but it is a breaking change and the code to delete processed messages needs to be removed otherwise 11 processed messages will error the entire event.

From the docs for reference:

Reporting batch item failures

When your Lambda function encounters an error while processing a batch, all messages in that batch become visible in the queue again by default, including messages that Lambda processed successfully. As a result, your function can end up processing the same message several times.

To avoid reprocessing all messages in a failed batch, you can configure your event source mapping to make only the failed messages visible again. To do this, when configuring your event source mapping, include the value ReportBatchItemFailures in the FunctionResponseTypes list. This lets your function return a partial success, which can help reduce the number of unnecessary retries on records.

Report syntax

After you include ReportBatchItemFailures in your event source mapping configuration, you can return a list of the failed message IDs in your function response. For example, suppose you have a batch of five messages, with message IDs id1, id2, id3, id4, and id5. Your function successfully processes id1, id3, and id5. To make messages id2 and id4 visible again in your queue, your response syntax should look like the following:

{
"BatchItemFailures": [
{
"ItemIdentifier": "id2"
},
{
"ItemIdentifier": "id4"
}
]
}

@heitorlessa
Copy link
Contributor

heitorlessa commented Nov 24, 2021 via email

@whardier
Copy link
Contributor Author

From the docs: "Similar to Resource objects, Session objects are not thread safe and should not be shared across threads and processes. You should create a new Session object for each thread or process"

I've not had a chance to test out performance between concurrent futures and simple threading using the threading module. I only have experience using the threading module for this. I'm not sure what approach aws_lambda_powertools_python would like to take as threading eeks its way into the codebase or how session setup should be passed around.

@heitorlessa
Copy link
Contributor

heitorlessa commented Nov 25, 2021 via email

@heitorlessa heitorlessa added p1 and removed triage Pending triage from maintainers labels Nov 25, 2021
@whardier
Copy link
Contributor Author

whardier commented Nov 25, 2021 via email

@heitorlessa
Copy link
Contributor

Forgot to add the correct label to auto-close two releases ago ;) Thanks again @whardier for reporting it and by kicking off the initial fix that @mploski completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p1
Projects
None yet
Development

No branches or pull requests

2 participants