Skip to content

Attempt timeouts causing non-retryable AbortedExceptions #1684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
isaac-jordan opened this issue Mar 4, 2020 · 11 comments
Closed

Attempt timeouts causing non-retryable AbortedExceptions #1684

isaac-jordan opened this issue Mar 4, 2020 · 11 comments
Labels
bug This issue is a bug.

Comments

@isaac-jordan
Copy link

Expected Behavior

I would expect a ApiCallAttemptTimeoutException to be thrown.

Current Behavior

After setting a custom value for ClientOverrideConfiguration.builder().apiCallAttemptTimeout for a DynamoDB client I began to see occasional SdkClientException errors with a message of Unable to unmarshall response (null). Response Code: 200, Response Text: OK, and a cause of AbortedException.

These exceptions are not retried by the SDK, causing my calls to error out completely without following my retry policy (like it would if a ApiCallAttemptTimeoutException was thrown). This is causing failures for my service.

Logs showing the stack trace internal to the SDK are available here

Possible Solutions

  • Catch AbortedExceptions that are caused by the thread being interrupted when an API call attempt times out and re-throw an ApiCallAttemptTimeoutException.
  • Or, add AbortedException as a retryable exception in SdkDefaultRetrySetting

Steps to Reproduce (for bugs)

This Kotlin snippet manages to reproduce it for me most of the time, since it keeps retrying until it decides to stop (often because of an AbortedException). If this doesn't work for you, try tweaking the apiCallAttemptTimeout until you see an AbortedException.

Note that I can also reproduce this more reliably by setting a very low timeout (e.g. 2ms), but that's definitely a more artificial test (but should also be fixed).

        val retryPolicy = RetryPolicy.defaultRetryPolicy().toBuilder()
            .numRetries(30)
            .retryCondition(OrRetryCondition.create(
                RetryCondition.defaultRetryCondition(),
                RetryCondition { c: RetryPolicyContext ->
                    c.exception().printStackTrace() // Print out error for each failure to make retries obvious
                    println("Retries attempt: ${c.retriesAttempted()}")
                    false
                }
                )
            )
            .build()

        val client = DynamoDbClient.builder()
            .overrideConfiguration {
                it.retryPolicy(retryPolicy)

                // Should retry up to 30 times, but often exits early due to an AbortedException (depends on your luck and internet connection)
                it.apiCallAttemptTimeout(Duration.ofMillis(100)) // software.amazon.awssdk.core.exception.SdkClientException: Unable to unmarshall response (null). Response Code: 200, Response Text: OK
            }.build()

        client.listTables()

Context

This is causing regular 500s in my service because the SdkClientExceptions are not retried, despite being a scenario that should be retried (an attempt timeout).

I'm considering adding a custom RetryCondition to check for SdkClientException caused by AbortedException, but let me know if the real fix is on the way (or if this suggested workaround isn't appropriate for some reason).

Your Environment

  • AWS Java SDK version used: 2.10.77
  • JDK version used: 8
  • Operating System and version: Occurring in a Java Lambda in prod, reproducible locally on OSX 10.13.6
@debora-ito debora-ito self-assigned this Mar 10, 2020
@debora-ito debora-ito added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Mar 10, 2020
@isaac-jordan
Copy link
Author

Bump! Is there any update on this @debora-ito?

I added a work-around via a custom RetryCondition, but would be awesome if this could be fixed/addressed at the root cause.

@debora-ito
Copy link
Member

Hi @isaac-jordan sorry for the long pause, I'll take a look at this today or tomorrow.

@debora-ito
Copy link
Member

I understand the issue, but I'm having trouble to reproduce it. Tweaking the timeout to a very low value is not working for my tests, maybe I need to build a more bulky test case. Still investigating.

gimki pushed a commit to aws-cloudformation/aws-cloudformation-resource-providers-codeguru-profiler that referenced this issue Mar 31, 2020
Issue has been raised on aws/aws-sdk-java-v2#1684
that AWS Sdk client returns a non-retryable AbortedExceptions when an
API call attempt timeout.

This change added a new retry condition where the client will now retry
when AbortedException is returned.
@debora-ito
Copy link
Member

debora-ito commented Apr 2, 2020

HI @isaac-jordan can you provide the full stacktrace of the Unable to unmarshall response error you are seeing? I'm wondering if it is correlated to the ApiCallAttemptTimeout or not.

I was not able to reproduce.
I used apiCallAttemptTimeout(Duration.ofMillis(100)) and when I called listTables I see the retries and the ApiCallAttemptTimeoutException:

2020-04-02 16:10:19 [main] DEBUG software.amazon.awssdk.request:84 - Sending Request: DefaultSdkHttpFullRequest(httpMethod=POST, protocol=https, host=dynamodb.us-west-2.amazonaws.com, encodedPath=/, headers=[amz-sdk-invocation-id, Content-Length, Content-Type, User-Agent, X-Amz-Target], queryParameters=[])
2020-04-02 16:10:19 [main] DEBUG software.amazon.awssdk.request:84 - Retryable error detected. Will retry in 6ms. Request attempt number 2
2020-04-02 16:10:19 [main] DEBUG software.amazon.awssdk.request:84 - Retrying Request: DefaultSdkHttpFullRequest(httpMethod=POST, protocol=https, host=dynamodb.us-west-2.amazonaws.com, encodedPath=/, headers=[amz-sdk-invocation-id, Content-Length, Content-Type, User-Agent, X-Amz-Target], queryParameters=[])
2020-04-02 16:10:19 [main] DEBUG software.amazon.awssdk.request:84 - Retryable error detected. Will retry in 46ms. Request attempt number 3
2020-04-02 16:10:19 [main] DEBUG software.amazon.awssdk.request:84 - Retrying Request: DefaultSdkHttpFullRequest(httpMethod=POST, protocol=https, host=dynamodb.us-west-2.amazonaws.com, encodedPath=/, headers=[amz-sdk-invocation-id, Content-Length, Content-Type, User-Agent, X-Amz-Target], queryParameters=[])
2020-04-02 16:10:19 [main] DEBUG software.amazon.awssdk.request:84 - Retryable error detected. Will retry in 144ms. Request attempt number 4
2020-04-02 16:10:20 [main] DEBUG software.amazon.awssdk.request:84 - Retrying Request: DefaultSdkHttpFullRequest(httpMethod=POST, protocol=https, host=dynamodb.us-west-2.amazonaws.com, encodedPath=/, headers=[amz-sdk-invocation-id, Content-Length, Content-Type, User-Agent, X-Amz-Target], queryParameters=[])

Exception in thread "main" software.amazon.awssdk.core.exception.ApiCallAttemptTimeoutException: HTTP request execution did not complete before the specified timeout configuration: 100 millis
	at software.amazon.awssdk.core.exception.ApiCallAttemptTimeoutException$BuilderImpl.build(ApiCallAttemptTimeoutException.java:88)
	at software.amazon.awssdk.core.exception.ApiCallAttemptTimeoutException.create(ApiCallAttemptTimeoutException.java:38)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.generateApiCallAttemptTimeoutException(TimeoutExceptionHandlingStage.java:143)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.handleTimeoutCausedException(TimeoutExceptionHandlingStage.java:111)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.translatePipelineException(TimeoutExceptionHandlingStage.java:94)
	at software.amazon.awssdk.core.internal.http.pipeline.stages.TimeoutExceptionHandlingStage.execute(TimeoutExceptionHandlingStage.java:79)
...

Edit: nevermind, I see you already provided the logs.

@debora-ito debora-ito added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Apr 2, 2020
@zoewangg
Copy link
Contributor

Hi @isaac-jordan thank you for reporting the issue! The SDK should check AbortedException at earlier stages and throw ApiCallAttemptTimeoutException if the request is aborted due to api call attempt timing out. Marking the issue as a bug.

@zoewangg zoewangg added bug This issue is a bug. and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. labels Apr 15, 2020
@isaac-jordan
Copy link
Author

Thanks @zoewangg! We're still using our workaround using a custom retry condition, and haven't had issues with this since then. Obviously nice if it is fixed for everyone/everywhere though :)

@debora-ito debora-ito removed their assignment Apr 16, 2020
@joesustaric
Copy link

Hey @isaac-jordan - We've also encountered this exception using this library.
What was the work around that you implemented?

You've said you're using a custom retry condition. Are you specifically catching the SdkClientException error and checking the message to perform a retry?

@nickdk
Copy link

nickdk commented Nov 20, 2020

We're seeing the same issue after changing this apiCallAttemptTimeout to 1s, it's a low occurence given the volume we do but it does happen regularly.

Would be nice to have an update @zoewangg and also interested in the workaround of @isaac-jordan.

@cenedhryn
Copy link
Contributor

@nickdk We haven't gotten to this issue yet but thanks for reporting that you're also affected. Number of occurrences is a factor in prioritizing our bug fixes.

@debora-ito
Copy link
Member

This issue was fixed via #2220 and was released in 2.15.58. Please try it out and let us know if you are still experiencing this.

@github-actions
Copy link

github-actions bot commented Aug 9, 2021

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug.
Projects
None yet
Development

No branches or pull requests

6 participants