Skip to content

Introduce API to avoid cancel/fail races with CancellableContinuation #830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
elizarov opened this issue Nov 14, 2018 · 11 comments
Closed

Comments

@elizarov
Copy link
Contributor

elizarov commented Nov 14, 2018

Consider this attempt to integrate coroutines with cancellable callback-based API using suspendCancellableCoroutine. This example is for Retrofit, but the same issue could appear with any cancellable API:

suspend fun <T> Call<T>.await(): T = 
    suspendCancellableCoroutine { cont ->
        enqueue(object : Callback<T> { // install callback
            override fun onResponse(call: Call<T>, response: Response<T>) {
                 // resume normally -- omitted for clarity
            }
    
            override fun onFailure(call: Call<T>, t: Throwable) {
                if (cont.isCancelled) return     // LINE (1) 
                cont.resumeWithException(t)      // LINE (2)
            }
        })
        // cancel Call when continuation is cancelled
        cont.invokeOnCancellation {
            cancel()
        }
    }

Resuming a CancellableContinuation with exception is documented to behave in the following way (see https://kotlin.github.io/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines/-cancellable-continuation/index.html):

Invocation of resume or resumeWithException in resumed state produces IllegalStateException. Invocation of resume in cancelled state is ignored (it is a trivial race between resume from the continuation owner and outer job cancellation and cancellation wins). Invocation of resumeWithException in cancelled state triggers exception handling of passed exception.

So, this code attempts to avoid calling resumingWithException on an already cancelled continuation by doing a check in LINE (1). This creates a "check & act" race in between LINE (1) and LINE (2). In particular, if continuation gets cancelled from a different thread when this code executes between LINE (1) and LINE (2), then resumeWithException is invoked on cancelled continuation and triggers handling of "uncaught" exception.

There is no public (non-experimental, non-internal) API at the moment to work around this race.

Workarounds

W1. Use internal API on CancellableContinuation:

Replace LINE (1) and LINE(2) with the following code:

            @UseExperimental(InternalCoroutinesApi::class)
            cont.tryResumeWithException(t)?.let { cont.completeResume(it) }

This workaround ignores exceptions when continuation is cancelled and this is the disadvantage of this workaround at the same time. If a true exception (failure) happens concurrently with cancellation, then it is going to be ignored instead of being handled. Arguably this problem is much less severe, but still.

W2. Use internal API on Job:

  1. Replace suspendCancellableCoroutine with suspendCoroutine.
  2. Replace cont.invokeOnCancellation { ... } with the following code:
        @UseExperimental(InternalCoroutinesApi::class)
        cont.context[Job]?.invokeOnCompletion(onCancelling = true) {
            cancel()
        }
  1. Replace LINE (1) with detection of "Cancelled" state on a Call (to see if the call had failed or was cancelled) and resume continuation with CancellationException in case of cancellation:
            if (call.isCanceled) {
                cont.resumeWithException(CancellationException("Cancelled"))
                return
            }

This workaround correctly handles exceptions that occur concurrently with cancellation (the call is either going to be cancelled or fails and we learn what had happened).

Proposed solutions

There are several possible solutions:

S1. Ignore exceptions when resuming a cancelled continuation using resumeWithException. Technically, this is a breaking change, but that is not a major problem here. The problem is that there could be a genuine exception that needs to be handled and ignoring it in resumeWithException is as bad as any other instance of ignoring an exception.

S2. Introduce new resumeWithExceptionOrIgnoreOnCancel method (name is purely provisional). It is not a breaking change, but it suffers from the same problem of ignoring a genuine exception when it happens concurrently with cancellation. This is not a severe problem, though, and this could still be one of the solutions.

S3. New API for cancellable callbacks with an intermediate cancelling state. The idea is that invocation of cancel() should not immediately result in resumption of the suspended coroutine, but shall wait until resumeWith (value or exception) is invoked. It gives a chance to see whether the operation was indeed cancelled successfully or failed while we trying to cancel it. It requires writing somewhat more code, though, similar to the code described in the second workaround.

  • This can be a new mode operation for suspendCancellableCoroutine.
  • This can be a new top-level function.
@elizarov elizarov changed the title Introduce CancellableContinuation API to avoid cancel/fail races Introduce API to avoid cancel/fail races with CancellableContinuation Nov 14, 2018
@LouisCAD
Copy link
Contributor

I'm not sure I understand the solution *S3resumeWith is not called if the continuation is cancelled and the underlying callback is unregistered and its methods are never called? Or are you talking about resumeWith in another place?

@elizarov
Copy link
Contributor Author

elizarov commented Nov 14, 2018

Let me clarify S3. Right now, when CancellableContinuation is cancelled it immediately resumes the coroutine with CancellationException, so now if the operation we were waiting for had already crashed and resumeWithException is about to be called, it has no choice but to either ignore this exception or invoke uncaught exception handler.

With S3 the idea the we don't immediately resume a cancelled coroutine. Instead, we mark it as "cancelling" and wait until resumeXxx is invoked, so that if resumeWithException is invoked, we can resume the coroutine with this exception, so that we don't have this ugly choice between losing exception or handling it as uncaught one.

@LouisCAD
Copy link
Contributor

LouisCAD commented Nov 14, 2018 via email

@elizarov
Copy link
Contributor Author

@LouisCAD Yes, S3 works only for callback-based API that would still invoke a callback when you cancel. Retrofit and many other APIs are like that, but not all of them, so it cannot become the only solution out there.

@LouisCAD
Copy link
Contributor

LouisCAD commented Nov 14, 2018

@elizarov Intriguing, I didn't know some callback based APIs did that.

I'm wondering how the callback is invoked after call to cancel() on Call<T> in Retrofit2, since I see no mention of the behavior you're talking about in the javadoc. EDIT: there's a test for that.

@Tolriq
Copy link

Tolriq commented Nov 21, 2018

@elizarov Since all my tests and beta works perfectly with W1 I think I'll go to prod with that solution.

Since it's internal, is there any risk that this workaround is no more working at some point before this new API is created, or is the risk low enough to be OK for prod?

@IRus
Copy link
Contributor

IRus commented Nov 23, 2018

@LouisCAD for example in apache http client FutureCallback has cancelled listener.

@Tolriq
Copy link

Tolriq commented Nov 26, 2018

@elizarov So went to prod with https://gist.github.com/Tolriq/dcbdc62c0e29d3b321034e990c3c85ce containing Workaround 1.

While it greatly reduce the problem (Catched exception re thrown) (no more seen in 7K user beta group) it still happens with a way larger active user base.

I then tested to add a CoroutineExceptionHandler in case the issue was somewhere else it did not fix anything.

I then tested Workaround 2 and it did not change anything.

I then stopped relying on Exceptions and encapsulated them in a sealed class

    private suspend fun Call.await(): WorkerResult {
        return suspendCancellableCoroutine { continuation ->
            continuation.invokeOnCancellation {
                cancel()
            }
            enqueue(object : Callback {
                override fun onResponse(call: Call, response: Response) {
                    continuation.resume(WorkerResult.Success(response))
                }

                override fun onFailure(call: Call, e: IOException) {
                    if (call.isCanceled || !isActive) {
                        continuation.resume(WorkerResult.Error(CancellationException("Cancelled")))
                    } else {
                        continuation.resume(WorkerResult.Error(e))
                    }
                }
            })
        }
    }

And problem solved.

So I wonder where else in the code there's a race, maybe somewhere I the worker pool I use, or maybe inside couroutine or channels, but no idea where to search.

@pmhsfelix
Copy link

pmhsfelix commented Nov 28, 2018

My expectation is that most APIs will still call the completion callback on a cancellation scenario, so S3 seems adequate for most cases (@LouisCAD regarding Retrofit2, the documentation is not clear, however there seems to be a test just for that case - https://github.com/square/retrofit/blob/master/retrofit/src/test/java/retrofit2/CallTest.java#L647)

IMO, there should also be a non-internal/experimental way of achieving the behaviour of W2 also for the success case, i.e., ignoring a cancellation if the callback is called with success.

@cuichanghao

This comment has been minimized.

elizarov added a commit that referenced this issue Dec 15, 2018
There seems to be no other way to satisfactory fix the problem of
a race between cancellation on CancellationContinuation and
concurrent failure of the corresponding background process in a
way the distinguishes "failure from cancellation attempt" from
a "true failure" without placing undue burden on anyone who is
implementing `await()` extension function for various future types.

- Added testTimeoutCancellationFailRace that works as a perfect litmus
  test; being allowed to run for a while it realiable detects various
  problems in alternative approaches.
- Optimized CancellableContinuationImpl; merged its code with
  AbstractContinuation class that is not needed as a separate
  class and it removed.
- Deprecated and hidden CancellableContinuation.initCancellability();
  it is now always invoked after the end of the block that was
  passed to suspendCancellableCoroutine.
- Deprecated `holdCancellability` parameter to an internal
  suspendAtomicCancellableCoroutine function.

Fixes #892
Fixes #830
elizarov added a commit that referenced this issue Dec 15, 2018
There seems to be no other way to satisfactory fix the problem of
a race between cancellation on CancellationContinuation and
concurrent failure of the corresponding background process in a
way the distinguishes "failure from cancellation attempt" from
a "true failure" without placing undue burden on anyone who is
implementing `await()` extension function for various future types.

- Added testTimeoutCancellationFailRace that works as a perfect litmus
  test; being allowed to run for a while it realiable detects various
  problems in alternative approaches.
- Optimized CancellableContinuationImpl; merged its code with
  AbstractContinuation class that is not needed as a separate
  class and it removed.
- Deprecated and hidden CancellableContinuation.initCancellability();
  it is now always invoked after the end of the block that was
  passed to suspendCancellableCoroutine.
- Deprecated `holdCancellability` parameter to an internal
  suspendAtomicCancellableCoroutine function.

Fixes #892
Fixes #830
@elizarov
Copy link
Contributor Author

elizarov commented Dec 17, 2018

I've played with various alternative APIs to avoid cancel/fail races in CancellableContinuation and while it is possible to design and implement such API, it is invariable too fragile and hard to use correctly. So, the decision here is to simply ignore a racy exception that comes after cancel for a standard suspendCancellableCoroutine { ... } which automatically makes all existing await() implementations correct and race-free, without needing to apply any workaround.

We might provide a special fine-grained API in the future that would provide ability to detect and process racy exceptions if and only if there are compelling use-cases for such an API. We don't have any at the moment. So, this issue will be closed by PR #896.

elizarov added a commit that referenced this issue Dec 17, 2018
There seems to be no other way to satisfactory fix the problem of
a race between cancellation on CancellationContinuation and
concurrent failure of the corresponding background process in a
way the distinguishes "failure from cancellation attempt" from
a "true failure" without placing undue burden on anyone who is
implementing `await()` extension function for various future types.

- Added testTimeoutCancellationFailRace that works as a perfect litmus
  test; being allowed to run for a while it reliably detects various
  problems in alternative approaches.
- Optimized CancellableContinuationImpl; merged its code with
  AbstractContinuation class that is not needed as a separate
  class and removed.
- Deprecated and hidden CancellableContinuation.initCancellability();
  it is now always invoked after the end of the block that was
  passed to suspendCancellableCoroutine.
- Deprecated `holdCancellability` parameter to an internal
  suspendAtomicCancellableCoroutine function.

Fixes #892
Fixes #830
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants