Sending SQS message using async client sometimes never completes the Future #1207

ochrons · 2019-04-12T13:30:20Z

Expected Behavior

Sending messages with SqsAsyncClient.send(req) should return a Future that eventually completes.

Current Behavior

Sending thousands of SQS messages in quick succession triggers a very rare case of single send(req) never completing the Future. No result, no error, just disappears.

Possible Solution

There have been Netty-related issues in this space before, so I would look there first. With a previous version of the SDK (2.4.14) this was more common than in the 2.5 version.

Steps to Reproduce (for bugs)

Send a few hundred thousand SQS messages using the SqsAsyncClient.send method and connect a Future that times out (say, after 10 seconds) with the returned future and see which one completes first.

A Scala example for retrying the send after a timeout.

  private def sendRetry(request: SendMessageRequest, retryCount: Int = 3): Future[SendMessageResponse] = {
    val res     = sqs.sendMessage(request).toScala
    val timeout = APIErrorJVM.delayFuture[SendMessageResponse](Failure(new TimeoutException()), 10.seconds)
    Future.firstCompletedOf(List(res, timeout)) recoverWith {
      case _: TimeoutException if retryCount > 0 =>
        log.error(s"Timeout while sending message $request, retry count = $retryCount")
        sendRetry(request, retryCount - 1)
    }
  }

Context

SQS is used as ground truth in our application, and if sending SQS messages just invisibly fails, the whole application logic is in jeopardy. Had to add an application level timeout to the SDK call to circumvent this.

Your Environment

AWS Java SDK version used: 2.5.25
JDK version used: 1.8.0 172
Operating System and version: Linux in AWS

The text was updated successfully, but these errors were encountered:

zoewangg · 2019-04-12T16:29:16Z

Thank you for reporting!

I think this commit 066e65d (released in 2.5.0) might reduce the occurrences of the issue, but looks like there are more cases that could cause the uncompletable future. We will investigate it.

As a side note, the SDK supports timeout features out of box, see https://github.com/aws/aws-sdk-java-v2/blob/master/docs/BestPractices.md#utilize-timeout-configurations

ochrons · 2019-04-14T08:38:15Z

Actually my comment about 2.4.14 was incorrect, there was another issue that got fixed in 2.5 (received messages were rarely being left "in flight" but never delivered to application for processing). So I cannot say for sure if this uncompleting call behavior has changed from 2.4 to 2.5

millems · 2019-04-17T22:34:59Z

I've been able to reproduce this issue...

millems · 2019-04-17T22:39:04Z

Reproduction code:

/*
 * Copyright 2010-2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
 *
 * Licensed under the Apache License, Version 2.0 (the "License").
 * You may not use this file except in compliance with the License.
 * A copy of the License is located at
 *
 *  http://aws.amazon.com/apache2.0
 *
 * or in the "license" file accompanying this file. This file is distributed
 * on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
 * express or implied. See the License for the specific language governing
 * permissions and limitations under the License.
 */

package software.amazon.awssdk.services.sqs;

import java.time.Duration;
import java.time.Instant;
import java.util.UUID;
import java.util.concurrent.Executors;
import java.util.concurrent.Semaphore;
import java.util.concurrent.TimeUnit;
import org.junit.Test;
import software.amazon.awssdk.services.sqs.model.CreateQueueResponse;

public class Issue1207 {
    @Test
    public void test() throws InterruptedException {
        try (SqsAsyncClient client = SqsAsyncClient.create()) {
            String queueName = UUID.randomUUID().toString();
            CreateQueueResponse queue = client.createQueue(r -> r.queueName(queueName)).join();
            try {
                loadTest(client, queue.queueUrl());
            } finally {
                client.deleteQueue(r -> r.queueUrl(queue.queueUrl())).join();
            }
        }
    }

    private void loadTest(SqsAsyncClient client, String queueUrl) throws InterruptedException {
        int concurrentRequests = 100;
        Semaphore concurrencySemaphore = new Semaphore(concurrentRequests);
        Instant endTime = Instant.now().plusSeconds(60);

        System.out.println("Starting...");

        Executors.newSingleThreadExecutor().submit(() -> {
            try {
                while (true) {
                    long timeLeft = Duration.between(Instant.now(), endTime).getSeconds();
                    System.out.println("Seconds left in test: " + timeLeft + ", Open permits: " + concurrencySemaphore.availablePermits());
                    Thread.sleep(5_000);
                }
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            }
        });

        while (endTime.isAfter(Instant.now())) {
            concurrencySemaphore.acquire(1);

            client.sendMessage(r -> r.queueUrl(queueUrl).messageBody("{}"))
                  .whenComplete((r, t) -> {
                      if (t != null) {
                          t.printStackTrace();
                      }
                      concurrencySemaphore.release(1);
                  });
        }

        System.out.println("Spinning down...");

        if (!concurrencySemaphore.tryAcquire(concurrentRequests, 30, TimeUnit.SECONDS)) {
            int missingResponses = concurrentRequests - concurrencySemaphore.availablePermits();
            throw new IllegalStateException(missingResponses + " requests didn't complete.");
        }
    }
}

millems · 2019-04-19T03:40:48Z

This was a tricky one.

It looks like in some rare edge cases, when we acquire a connection from the connection pool, it isn't active, and the health checks at the netty level didn't catch it for us. Fixing the netty-level health check (it looks like it's broken?) improves things slightly, but it was still happening occasionally if the connection was closed between acquiring it from the pool and us attaching our handlers that monitor for the close.

I've moved the health check fully up the stack until after we've added our connection-close monitors and that seems to have fixed the problem.

I'll be running some longer-term tests to make sure it's definitely licked before putting out a PR.

…pleted. If a service closes a connection between when a channel is acquired and handlers are attached, channel writes could disappear and the response future would never be completed. This change introduces health checks and retries for channel acquisition to fix the majority of cases without failing requests, as well as one last check after handlers are added to ensure the channel hasn't been closed since the channel pool health check. Fixes #1207.

millems · 2019-04-19T22:04:34Z

A fix will go out for this on Monday's release. Please reopen this issue if you're still seeing the problem at that time. Our tests are no longer able to reproduce it after this change.

…97697f093 Pull request: release <- staging/5efdb921-642c-46b1-bada-f7b97697f093

spfink added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Apr 15, 2019

millems added the bug This issue is a bug. label Apr 17, 2019

millems mentioned this issue Apr 19, 2019

Fixed a bug in the netty client, where a future may not always be completed. #1217

Merged

millems closed this as completed in #1217 Apr 19, 2019

aws-sdk-java-automation pushed a commit that referenced this issue Feb 26, 2021

Merge pull request #1207 from aws/staging/5efdb921-642c-46b1-bada-f7b…

0f454a6

…97697f093 Pull request: release <- staging/5efdb921-642c-46b1-bada-f7b97697f093

deepalii mentioned this issue Oct 31, 2021

AWS SQS sendMessageAsync future sometimes never completes aws/aws-sdk-java#2661

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sending SQS message using async client sometimes never completes the Future #1207

Sending SQS message using async client sometimes never completes the Future #1207

ochrons commented Apr 12, 2019

zoewangg commented Apr 12, 2019

ochrons commented Apr 14, 2019

millems commented Apr 17, 2019

millems commented Apr 17, 2019

millems commented Apr 19, 2019 •

edited

Loading

millems commented Apr 19, 2019

Sending SQS message using async client sometimes never completes the Future #1207

Sending SQS message using async client sometimes never completes the Future #1207

Comments

ochrons commented Apr 12, 2019

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

zoewangg commented Apr 12, 2019

ochrons commented Apr 14, 2019

millems commented Apr 17, 2019

millems commented Apr 17, 2019

millems commented Apr 19, 2019 • edited Loading

millems commented Apr 19, 2019

millems commented Apr 19, 2019 •

edited

Loading