-
Notifications
You must be signed in to change notification settings - Fork 90
Improve handling of messages with same MessageGroupId in SQS Batch processor #1140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @MartinMitro thanks for raising this! Looking to the python powertools for inspiration, I see:
The user then picks either the I believe this would solve your problem and maintain consistency with the other powertools implementations. In your example, the failed update to entity A would cause the batch to be returned as unprocessed, and the update to entity B would not be passed to the handler until A had succeeded. What do you think? |
Hey @scottgerring, thanks for the response. If I understand correctly, Lambda can receive messages with different MessageGroupIds. If the implementation would fail on the first message, regardless of message group id, that would just postpone processing of not related message. That could repeat also multiple times - maybe causing messages to be sent to DLQ or be discarded. I would therefore suggest to fail only messages with same MessageGroupId and process the rest. I look forward to hearing from you. |
Hey @MartinMitro, I've dug into this a bit deeper to try and understand how python powertools arrived at its implementation. The initial PR adding support for the feature links to the SQS documentation page Implementing partial batch responses which advises the following:
Do you think this will be a workable solution in your case? |
Hey @scottgerring, let me verify that statement with Lambda / SQS team. Once I have response I will come back to you. Thank you |
Hey @MartinMitro , i've started sketching out a fix for this in #1183 . I have also tried to reach out to the SQS team to determine the correct way of handling this situation. |
Hey @scottgerring, I've received feedback to my support case and it seems lambda can receive only one message group id per batch. It references this documentation: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#events-sqs-scaling So in fact python implementation is correct and we can short circuit rest of the processing if one message fails. |
Hey @MartinMitro - great - and reassuring to see they've come to the same documentation. I can change the implementation in the PR to do this easily enough. I've been hunting internally for some more canonical wording here but have come up empty handed - if you don't mind forwarding me the messaging from support ( gerrings at amazon ) it might help scratch that itch! |
This is now released under 1.16.1 version! |
Key information
Summary
Once the first message of Batch fails and we know it will be returned back to queue, rest of the messages with same message group id should no be processed
Motivation
We want to be complaint with the FIFO SQS principle and process messages within same group in order.
Proposal
If the processing of the batch message fails with retryable exception and message attributes contain message group id, then the id should be stored and subsequent message of batch should be checked against it. If there is a match, message processing should be skipped and message also returned to queue.
Drawbacks
As in FIFO queues, next messages of the same message group will be processed only when the failing was in removed from queue.
Rationale and alternatives
What other designs have been considered? Why not them?
What is the impact of not doing this?
Messages are not processed in order. Example: Lets say we have entity, which with id 1, and that ID is also message group ID. Now we have stack of updates in FIFO queue, first one updates value of entity to A, second one to value B. First one fails lets say on connection problem and will be retried later, second one succeeds and value is updated to B. Once the visibility timeout of first message expires, it will be processed again and value of entity with id 1 will be updated to A. Entity ends with invalid state.
Unresolved questions
The text was updated successfully, but these errors were encountered: