Skip to content

Feature request: Optimize Circuit-Breaking for SQS FIFO Partial Processor on records with different group IDs #2981

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
rubenfonseca opened this issue Aug 21, 2023 Discussed in #2936 · 10 comments · Fixed by #3954
Closed
2 tasks done
Assignees
Labels
batch Batch processing utility feature New feature or functionality

Comments

@rubenfonseca
Copy link
Contributor

Use case

Discussed in #2936

We confirmed that SQS FIFO queues can deliver messages from different Group IDs to the same Lambda invocation. Right now, when a record fails processing, we short-circuit an fail the rest of the items on the invocation, regardless if they are from the same message group ID or not.

This un-optimal experience can lead to unintended messages landing on DLQs or failing completely.

We need to explore the idea of continuing to process other group IDs on the same invocation.

Solution/User Experience

When a message fails processing on the SQS FIFO resolver, collect the remaining messages from the same group ID. When a new group ID is found, start processing again.

At the end, return all the failed messages from each message group ID.

Alternative solutions

Depending how the implementation and behaviour turns out, we might consider not making the new behavior the default one, and keep it behind a feature flag.

Acknowledgement

From the discussion

Originally posted by duc00 August 8, 2023
Hello,

According to Implementing partial batch responses AWS doc:

If you're using this feature with a FIFO queue, your function should stop processing messages after the first failure and return all failed and unprocessed messages in batchItemFailures. This helps preserve the ordering of messages in your queue.

I see that Powertools implementation of the doc, with SqsFifoPartialProcessor, is strictly following this recommendation. This question is thus more specific to AWS implementation of partial batch responses with FIFO. Posting the question on this repo seems to be a good entry point nonetheless, since both AWS developers and community interact on those subjects.

My problem is the following:
I am processing a SQS FIFO queue with SqsFifoPartialProcessor. My batch size is 10. Since the queue is not high-scale, the batch often contains messages with different message group IDs. When a failure occurs, the rest of the processing is stopped and all records left are returned as failures to the queue. So I often end-up with valid records in my dead-letter queue just because they were processed in a batch containing an unrelated invalid one.

My question:
Would it be valid, after a failure, to return all remaining records in the batch but only the ones with the same group ID? According to the doc, current implementation is recommended to preserve the ordering of messages in your queue. I am not seeing why processing other records with a different group ID would go against that. Thus my question.

Many thanks!

@rubenfonseca
Copy link
Contributor Author

Here's our plan:

  • we'll write some sample code to try to emulate the behavior, to make sure we can test it
  • we'll try and test to continue to process other message group ids, to see if there isn't any bad consequences
  • we'll assess how this may or may not be a breaking change, and evaluate if this should be the default behavior or behind a feature flag

@rubenfonseca rubenfonseca moved this from Triage to Backlog in Powertools for AWS Lambda (Python) Aug 21, 2023
@rubenfonseca rubenfonseca added feature New feature or functionality batch Batch processing utility labels Aug 21, 2023
@leandrodamascena
Copy link
Contributor

Hello @duc00! I'm adding this issue to our next iteration (09/09 - 22/09) so we have time to validate scenarios and decide the way forward.

Thank you for your patience, but this is a complex case and may involve a breaking change in the current behavior.

@leandrodamascena
Copy link
Contributor

Hello everybody! Updating this issue after some time. We continued working on this and we opened an internal ticket to the service team in order to fully understand this issue, as we found some more data that leaves me confused.

As soon as we receive a response to this ticket, I will update this issue and we will work on it.

Thank you, everyone, for your patience.

@leandrodamascena leandrodamascena self-assigned this Dec 27, 2023
@leandrodamascena leandrodamascena moved this from Backlog to On hold in Powertools for AWS Lambda (Python) Dec 27, 2023
@leandrodamascena
Copy link
Contributor

Hello everybody! After a long time without an update, we finally have a direction for this PR! We have received confirmation from the SQS team that a batch can be delivered to Lambda with different group ids.

When we implemented this specialized batch for FIFO, we thought that each MessageGroupId would be delivered in a single Lambda execution, as the documentation leads us to understand this:

https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#events-sqs-scaling
For FIFO queues, Lambda sends messages to your function in the order that it receives them. When you send a message to a FIFO queue, you specify a message group ID. Amazon SQS ensures that messages in the same group are delivered to Lambda in order. Lambda sorts the messages into groups and sends only one batch at a time for a group. If your function returns an error, the function attempts all retries on the affected messages before Lambda receives additional messages from the same group.

Given all the information we have, our next steps will be:

1 - Add a flag to SqsFifoPartialProcessor so that the customer can intentionally choose whether to stop processing at the first error or whether to stop processing just that MessageGroupId and process the others.
2 - The default behavior will be the current one, that is, we will return the entire batch as soon as we find a failure.
3 - Update our documentation to make Batch behavior in FIFO queue executions clearer.
4 - Talk to the Lambda/SQS documentation team so they can modify the documentation to make this clearer to customers.

I'll work to send a PR until the next week.

Thank you for everyone's patience until we could get the correct information.

@duc00
Copy link
Contributor

duc00 commented Mar 14, 2024

Hey @leandrodamascena, great news, thanks. 😄 Let me know if I can help review the code once it's out.

1 - Add a flag to SqsFifoPartialProcessor so that the customer can intentionally choose whether to stop processing at the first error or whether to stop processing just that MessageGroupId and process the others.

Do you think the flag set to MessageGroupId only could be made default in a future major release? This seems like a sensible default behaviour, once the AWS SQS doc is updated.

@leandrodamascena
Copy link
Contributor

Do you think the flag set to MessageGroupId only could be made default in a future major release? This seems like a sensible default behaviour, once the AWS SQS doc is updated.

Hello @duc00! I think we can wait for customer feedback before deciding whether to make this behavior the default in Powertools v3. However, I'll keep this idea in mind to ensure we don't lose track of it.

The PR is all set for review. If you have any comments or spot any issues, feel free to share them and make comments. Your feedback is always appreciated!

@leandrodamascena leandrodamascena moved this from Working on it to Pending review in Powertools for AWS Lambda (Python) Mar 14, 2024
@github-project-automation github-project-automation bot moved this from Pending review to Coming soon in Powertools for AWS Lambda (Python) Mar 20, 2024
Copy link
Contributor

⚠️COMMENT VISIBILITY WARNING⚠️

This issue is now closed. Please be mindful that future comments are hard for our team to see.

If you need more assistance, please either tag a team member or open a new issue that references this one.

If you wish to keep having a conversation with other community members under this issue feel free to do so.

@github-actions github-actions bot added the pending-release Fix or implementation already in dev waiting to be released label Mar 20, 2024
Copy link
Contributor

This is now released under 2.36.0 version!

@github-actions github-actions bot removed the pending-release Fix or implementation already in dev waiting to be released label Mar 27, 2024
@duc00
Copy link
Contributor

duc00 commented Mar 28, 2024

Awesome! Thank you all 😊

@leandrodamascena
Copy link
Contributor

And the Lambda team has updated the documentation to make it clear that a batch can have more than one MessageGroupID. Thank you so much for bringing this topic up and helping us improve the Powertools experience and AWS documentation, your contribution has been invaluable here @duc00.

https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#events-sqs-scaling

Before

"For FIFO queues, Lambda sends messages to your function in the order that it receives them. When you send a message to a FIFO queue, you specify a message group ID. Amazon SQS ensures that messages in the same group are delivered to Lambda in order. Lambda sorts the messages into groups and sends only one batch at a time for a group. If your function returns an error, the function attempts all retries on the affected messages before Lambda receives additional messages from the same group."

Now

For FIFO queues, Lambda sends messages to your function in the order that it receives them. When you send a message to a FIFO queue, you specify a message group ID. Amazon SQS ensures that messages in the same group are delivered to Lambda in order. When Lambda reads your messages into batches, each batch may contain messages from more than one message group, but the order of the messages is maintained. If your function returns an error, the function attempts all retries on the affected messages before Lambda receives additional messages from the same group.

@leandrodamascena leandrodamascena moved this from Coming soon to Shipped in Powertools for AWS Lambda (Python) Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
batch Batch processing utility feature New feature or functionality
Projects
Status: Shipped
3 participants