Improve handling of messages with same MessageGroupId in SQS Batch processor #1140

MartinMitro · 2023-04-25T07:17:40Z

Key information

RFC PR: (leave this empty)
Related issue(s), if known:
Area: (i.e. Tracer, Metrics, Logger, etc.)
Meet tenets: (Yes/no)

Summary

Once the first message of Batch fails and we know it will be returned back to queue, rest of the messages with same message group id should no be processed

Motivation

We want to be complaint with the FIFO SQS principle and process messages within same group in order.

Proposal

If the processing of the batch message fails with retryable exception and message attributes contain message group id, then the id should be stored and subsequent message of batch should be checked against it. If there is a match, message processing should be skipped and message also returned to queue.

Drawbacks

As in FIFO queues, next messages of the same message group will be processed only when the failing was in removed from queue.

Rationale and alternatives

What other designs have been considered? Why not them?
What is the impact of not doing this?

Messages are not processed in order. Example: Lets say we have entity, which with id 1, and that ID is also message group ID. Now we have stack of updates in FIFO queue, first one updates value of entity to A, second one to value B. First one fails lets say on connection problem and will be retried later, second one succeeds and value is updated to B. Once the visibility timeout of first message expires, it will be processed again and value of entity with id 1 will be updated to A. Entity ends with invalid state.

Unresolved questions

scottgerring · 2023-05-23T01:16:14Z

Hey @MartinMitro thanks for raising this!

Looking to the python powertools for inspiration, I see:

When using SQS FIFO queues, we will stop processing messages after the first failure, and return all failed and unprocessed messages in batchItemFailures. This helps preserve the ordering of messages in your queue.

The user then picks either the BatchProcessor or the SqsFifioPartialProcessor to choose the behavior. As far a I can see this logic is not yet implemented in Lambda powertools for Java.

I believe this would solve your problem and maintain consistency with the other powertools implementations. In your example, the failed update to entity A would cause the batch to be returned as unprocessed, and the update to entity B would not be passed to the handler until A had succeeded.

What do you think?

MartinMitro · 2023-05-23T12:10:47Z

Hey @scottgerring,

thanks for the response. If I understand correctly, Lambda can receive messages with different MessageGroupIds. If the implementation would fail on the first message, regardless of message group id, that would just postpone processing of not related message. That could repeat also multiple times - maybe causing messages to be sent to DLQ or be discarded. I would therefore suggest to fail only messages with same MessageGroupId and process the rest.

I look forward to hearing from you.

scottgerring · 2023-05-24T03:07:18Z

Hey @MartinMitro, I've dug into this a bit deeper to try and understand how python powertools arrived at its implementation. The initial PR adding support for the feature links to the SQS documentation page Implementing partial batch responses which advises the following:

If you're using this feature with a FIFO queue, your function should stop processing messages after the first failure and return all failed and unprocessed messages in batchItemFailures. This helps preserve the ordering of messages in your queue.

Do you think this will be a workable solution in your case?

MartinMitro · 2023-05-31T08:46:30Z

Hey @scottgerring, let me verify that statement with Lambda / SQS team. Once I have response I will come back to you. Thank you

scottgerring · 2023-06-08T05:23:54Z

Hey @MartinMitro , i've started sketching out a fix for this in #1183 . I have also tried to reach out to the SQS team to determine the correct way of handling this situation.

MartinMitro · 2023-06-13T10:45:14Z

Hey @scottgerring, I've received feedback to my support case and it seems lambda can receive only one message group id per batch. It references this documentation: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#events-sqs-scaling

So in fact python implementation is correct and we can short circuit rest of the processing if one message fails.

scottgerring · 2023-06-14T03:32:35Z

Hey @MartinMitro - great - and reassuring to see they've come to the same documentation. I can change the implementation in the PR to do this easily enough.

I've been hunting internally for some more canonical wording here but have come up empty handed - if you don't mind forwarding me the messaging from support ( gerrings at amazon ) it might help scratch that itch!

github-actions · 2023-07-19T11:51:06Z

This is now released under 1.16.1 version!

MartinMitro added RFC triage labels Apr 25, 2023

github-project-automation bot added this to AWS Lambda Powertools for Java Apr 25, 2023

scottgerring self-assigned this May 24, 2023

scottgerring mentioned this issue Jun 8, 2023

fix: Handle batch failures in FIFO queues correctly #1183

Merged

6 tasks

scottgerring moved this to Working on it in AWS Lambda Powertools for Java Jun 9, 2023

sthulb added this to Powertools for AWS Lambda (Java) Jun 19, 2023

sthulb moved this to Working on it in Powertools for AWS Lambda (Java) Jun 19, 2023

scottgerring added status/staged-next-release feature-request New feature or request and removed triage RFC labels Jul 12, 2023

scottgerring changed the title ~~RFC: Improve handling of messages with same MessageGroupId in SQS Batch processor~~ Improve handling of messages with same MessageGroupId in SQS Batch processor Jul 13, 2023

jeromevdl moved this from Working on it to Coming soon in Powertools for AWS Lambda (Java) Jul 17, 2023

github-actions bot closed this as completed Jul 19, 2023

github-actions bot removed the status/staged-next-release label Jul 19, 2023

scottgerring moved this from Coming soon to Shipped in Powertools for AWS Lambda (Java) Aug 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of messages with same MessageGroupId in SQS Batch processor #1140

Improve handling of messages with same MessageGroupId in SQS Batch processor #1140

MartinMitro commented Apr 25, 2023 •

edited

Loading

scottgerring commented May 23, 2023

MartinMitro commented May 23, 2023

scottgerring commented May 24, 2023

MartinMitro commented May 31, 2023

scottgerring commented Jun 8, 2023

MartinMitro commented Jun 13, 2023

scottgerring commented Jun 14, 2023 •

edited

Loading

github-actions bot commented Jul 19, 2023

Improve handling of messages with same MessageGroupId in SQS Batch processor #1140

Improve handling of messages with same MessageGroupId in SQS Batch processor #1140

Comments

MartinMitro commented Apr 25, 2023 • edited Loading

Key information

Summary

Motivation

Proposal

Drawbacks

Rationale and alternatives

Unresolved questions

scottgerring commented May 23, 2023

MartinMitro commented May 23, 2023

scottgerring commented May 24, 2023

MartinMitro commented May 31, 2023

scottgerring commented Jun 8, 2023

MartinMitro commented Jun 13, 2023

scottgerring commented Jun 14, 2023 • edited Loading

github-actions bot commented Jul 19, 2023

MartinMitro commented Apr 25, 2023 •

edited

Loading

scottgerring commented Jun 14, 2023 •

edited

Loading