withContext and withTimeout may discard resources on cancellation #3504

dkhalanskyjb · 2022-10-26T12:52:44Z

val socket = withContext(Dispatchers.IO) {
  Socket("localhost", 8080)
}
socket.use {
  // ...
}

This "obviously correct" code has a bug: if cancellation happens after the last suspension point in the withContext block, the resource will be successfully created, but it will not be released, because withContext will throw a CancellationException after executing the block.

The same issue can be observed with withTimeout.

The proper way to write the code above is:

var socket: Socket? = null
try {
  withContext(Dispatchers.IO) {
    socket = Socket("localhost", 8080)
  }
} catch (e: Throwable) {
  socket?.close()
  throw e
}
socket.use {
  // ...
}

This is very error-prone, and it's quite confusing why the first version is buggy while the second one is not.

A simple solution is to make withContext avoid checking whether the coroutine was cancelled if a value was obtained and instead always return the value.

A counterargument against this solution is that this is an edge case that is likely immensely rare, but changing the behavior of withContext and withTimeout is a breaking change that affects the use cases that are arguably much more common. For example:

withContext(dispatcher) {
    doAVeryLongComputation()
}
updateUiElement()

If doAVeryLongComputation is not cancellable and withContext returns a value instead of throwing an exception, then updateUiElement may happen long after a cancellation is requested. To fix such code, one would have to do

withContext(dispatcher) {
    doAVeryLongComputation()
}
ensureActive()
updateUiElement()

or make doAVeryLongComputation itself cancellable.

The text was updated successfully, but these errors were encountered:

qwwdfsad · 2022-10-27T14:00:43Z

We've been on this road before and decided against this behaviour, breaking quite a lot of code in the meantime: #1813

The problem is that both patterns are erroneous, but taking into account that coroutines are dominant on Android and closeable resource pattern has much less demand, I do not see a compelling reason to change this behaviour and open a whole new class of application crashes and errors after coroutines update.

I believe there are other ways to resolve this issue:

Variations of universal closeable resource Universal API for use of closeable resources with coroutines #1191
IDEA inspection that detects Closeable resource that is passing through withContext and suggests to do "something" (as a first version, Investigate possible shortcuts for coroutineContext.job.invokeOnCompletion { if (it != null) ... } #3259 may help)
ATOMIC start mode for withContext or even a separate name for that

dkhalanskyjb · 2022-10-28T11:10:21Z

After an internal discussion with @qwwdfsad, it turns out that the problem with the proposal is much more severe than I thought. Here's the gist of it.

Basically, it's restating the "Problems with atomic cancellation" section of #1813, but expanded on and adapted a bit.

On Android, there's an (unwritten?) rule that it's invalid for a coroutine running on the UI thread to do anything meaningful after it was canceled. This is due to the fact that updating a UI after it was disposed of leads to crashes.

Luckily, the cancellation of UI-related coroutines also happens in the UI thread. So, typically, if some code executes in Dispatchers.Main, it can be completely certain that no other code will suddenly cancel the coroutine in parallel. This allows one to avoid checking whether the coroutine was canceled before every UI-updating operation: if it were canceled, it would either be by the code directly above—why would one do that?—or by some other code in the main thread, but then, this coroutine wouldn't even wake up after the suspension.

This works well. Changing withContext not to throw CancellationException after the value was acquired, however, would break this property. For example, even this code could become a source of bugs:

withContext(Dispatchers.Main) {
  val value = withContext(Dispatchers.Default) {
    0
  }
  updateUi(value)
}

The reason is, while 0 is computed, Dispatchers.Main is free to do some other work—including destroying the UI that updateUi operates on, canceling the scope of the demonstrated coroutine in the process. When 0 has been computed and is passed to value, we don't have the right to do anything else other than clean up the resources in that coroutine: updateUi will throw an exception.

So, this is not just about delayed results being displayed, it's about not requiring the cancellation to be cooperative and instead being mandatory in this particular case.

What about withTimeout though? I don't understand a reason for it to throw when it's finished. The prompt cancellation guarantee is not even stated in the docs.

dovchinnikov · 2022-10-28T12:49:08Z

I think scoping functions are more special, so there is no need for solution on the side of the sent value (like onUndeliveredElement in Channel or onCancellation in CancellableContinuation). The continuation which follows the scoping function should be resumed with both exception and the value (if it was computed), and the choice of how to handle the value/exception should be given to the caller.

Possibly, the value might be delivered together with the special CancellationException

try {
  withContext(Dispatchers.IO) {
    Socket("localhost", 8080)
  }
} 
catch (vce: ValueCancellationException) {
  (vce.value as? Socket).close()
  throw vce
}

The downside is that exceptions are not generic.

qwwdfsad · 2022-10-28T14:59:40Z

Such a solution has its own downsides -- it might have been the case that withContext had launched some children that also happen to catch ValueCancellationException and do something with it, it's the whole can of worms.

WDYT about tooling-assisted help?
Let's start with something straightforward for the following code:

anyScopedBuilderThatReturnsT { // async|withContext|coroutineScope
    ...compute...
    someSubsclassOfCloseable
}

let's suggest an intention (with a quickfix) to replace it with the following:

anyScopedBuilderThatReturnsT {
    ...compute...
    someSubsclassOfCloseable.also { r -> invokeOnCancellation { r.close() } }
}

dovchinnikov · 2022-10-28T15:29:05Z

We don't really rely on Closeable
invokeOnCancellation is not invoked on the same thread (invokeOnCompletion doesn't provide thread guarantees and doesn't allow to suspend #3505)
I think it's the responsibility of whoever got the value to dispose of it (or schedule a disposal to a proper thread) if it's not needed anymore

dovchinnikov · 2022-10-28T16:14:35Z

it might have been the case that withContext had launched some children that also happen to catch ValueCancellationException and do something with it, it's the whole can of worms

I don't get this argument. How a ValueCancellationException thrown from withContext might be caught by its children?

alexandru · 2023-04-23T12:29:33Z

Hi all,

For developers like me, learning Kotlin, this sample is even more of a problem than it looks.

Normally, the acquisition and the release of resources can be non-cancellable. So, this seems to work (and I had to test it, to see it actually working):

// NOTE: I don't understand why this works, as it's also making use of `withContext`, 
// but I'm guessing that the `NonCancellable` context will hide the cancellation status.
val socket = withContext(NonCancellable) {
  withContext(Dispatchers.IO) {
    Socket("localhost", 8080)
  }
}
socket.use {
  // ...
}

It requires training for users, but as a mental model, you can just say that acquisition and release have to be NON-CANCELLABLE, most of the time, and so as a user you need to ensure that they are non-cancellable. This is an easy rule to learn, especially once you get burned once or twice.

The issue here is that this Socket constructor also connects to the network, and that Socket#connect call is interruptible, having gained new capabilities with Project Loom, apparently. Interruption of Java sockets stuff is something we may want. This could be a sample of an acquisition that can be, and should be, cancellable.

Another use-case for a cancellable acquisition is that of a lock:

lock.acquire()
try {
  //...
} finally {
  lock.release()
}

Which could make use of Kotlin's coroutines:

suspend fun <T> CoroutineScope.withLock(
    lock: AtomicBoolean,
    block: suspend CoroutineScope.() -> T
): T {
    runInterruptible(Dispatchers.IO) {
        while (!lock.compareAndSet(false, true)) {
            Thread.onSpinWait()
            if (Thread.interrupted())
                throw InterruptedException()
        }
    }
    try {
        return block(this)
    } finally {
        lock.set(false)
    }
}

The behaviour of this code for me, as a beginner in Kotlin, is very unintuitive because my mental model for what can be cancelled is invalidated. For example, I assumed that runInterruptible can only be interrupted by Java code, and once passed that barrier (e.g., once we get over the Thread.interrupted check), the result will be invariably returned to the caller.

And fixing this particular sample is even more difficult, as the release of the resource shouldn't be done twice:

suspend fun <T> CoroutineScope.withLock(
    lock: AtomicBoolean,
    block: suspend CoroutineScope.() -> T
): T {
    var isLocked = false
    var throwFromUserCode = false
    try {
        runInterruptible(Dispatchers.IO) {
            while (!lock.compareAndSet(false, true)) {
                Thread.onSpinWait()
                if (Thread.interrupted())
                    throw InterruptedException()
            }
            isLocked = true
        }
        try {
            throwFromUserCode = true
            return block(this)
        } finally {
            lock.set(false)
        }
    } catch (e: Throwable) {
        if (!throwFromUserCode && isLocked)
            lock.set(false)
        throw e
    }
}

In my opinion™️, resource acquisition/release (and cancellation/interruption in general) is very unsafe if you don't know which instruction is guaranteed to execute next. In Java/Kotlin, we can think of ; as a sequencing operator, which could be flatMap / bind when working with monadic types. So, in a sequence of instructions like A ; B ; C ; D users have to know which ; can be interrupted and which can't be, and they also need control over it. The boundaries have to be crystal clear, with the elephant in the room here being that Java's blocking I/O interruption seems to be clearer in this case. Given everything else, the cancellation of withContext seems to be arbitrary, and not at all expected in the context of resource acquisition. Resource acquisition may not seem like a priority, but the whole point of having cancellable coroutines is to prevent leaks. So, even for Android this can be a problem, as I think (IMO) the more libraries get converted to Kotlin, the more people will rely on suspended functions for resource handling.

Furthermore, IMO, while I would love to see a universal ability to allocate/deallocate resources in Kotlin, tied to the current scope, this won't fix the fact that the boundaries of what's cancellable and what's not is unclear, given the prevalence of withContext. This issue will remain, manifested in other use-cases, or in the code of those that aren't learning of any new resource handling ability fast enough.

PS: Kotlin's Coroutines are wonderful, thank you so much for your contributions! ❤️

PPS: I have a JBang script to show the deadlock, maybe this is useful for others that want to play:

///usr/bin/env jbang "$0" "$@" ; exit $?

//JAVA 17+
//KOTLIN 1.8.20
//DEPS org.jetbrains.kotlinx:kotlinx-coroutines-core:1.7.0-RC

import java.util.concurrent.atomic.AtomicBoolean
import kotlinx.coroutines.CancellationException
import kotlinx.coroutines.CoroutineScope
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.cancelAndJoin
import kotlinx.coroutines.delay
import kotlinx.coroutines.launch
import kotlinx.coroutines.runBlocking
import kotlinx.coroutines.runInterruptible

fun main() = runBlocking {
    val lock = AtomicBoolean(false)
    repeat(10000) { index ->
        val job = launch {
            println("Starting job $index...")
            withLock(lock) {
                delay(1)
                println("Job $index done.")
            }
        }
        if (index % 2 == 0)
            launch {
                println("Cancelling job $index")
                try {
                    job.cancelAndJoin()
                } catch (_: CancellationException) {}
            }
    }
}

suspend fun <T> CoroutineScope.withLock(
    lock: AtomicBoolean,
    block: suspend CoroutineScope.() -> T
): T {
    return this.withLockLeaky(lock, block)
    // return this.withLockNonLeaky(lock, block)
}

suspend fun <T> CoroutineScope.withLockLeaky(
    lock: AtomicBoolean,
    block: suspend CoroutineScope.() -> T
): T {
    runInterruptible(Dispatchers.IO) {
        while (!lock.compareAndSet(false, true)) {
            Thread.onSpinWait()
            if (Thread.interrupted())
                throw InterruptedException()
        }
    }
    try {
        return block(this)
    } finally {
        lock.set(false)
    }
}

suspend fun <T> CoroutineScope.withLockNonLeaky(
    lock: AtomicBoolean,
    block: suspend CoroutineScope.() -> T
): T {
    var isLocked = false
    var throwFromUserCode = false
    try {
        runInterruptible(Dispatchers.IO) {
            while (!lock.compareAndSet(false, true)) {
                Thread.onSpinWait()
                if (Thread.interrupted())
                    throw InterruptedException()
            }
            isLocked = true
        }
        try {
            throwFromUserCode = true
            return block(this)
        } finally {
            lock.set(false)
        }
    } catch (e: Throwable) {
        if (!throwFromUserCode && isLocked)
            lock.set(false)
        throw e
    }
}

…ress indicator is shown" This reverts commit d2576dd2d846daa0b022e91991660fee85f03280. Consider this piece: ``` withContext(Dispatchers.EDT) { showIndicatorInUI(project, taskInfo, indicator) } ``` `showIndicatorInUI` was completed but `withContext` resumed with `CancellationException` if the coroutine is cancelled concurrently, which happens if the task takes about 300ms and completes while `showIndicatorInUI` is executed. `CancellationException` from `withContext` prevented `indicator.finish`. `withContext` resuming with CE even if block completed without CE is tracked here: Kotlin/kotlinx.coroutines#3504 GitOrigin-RevId: e684335dcb7bb2eb8abb7399eec7cdf1788470d2

dkhalanskyjb added the bug label Oct 26, 2022

globsterg mentioned this issue Jun 28, 2024

Extend the KDoc for CoroutineStart #4147

Merged

dkhalanskyjb mentioned this issue Apr 15, 2025

Coroutines guide should warn about cancellation exceptions when transferring resources across suspension points #4413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

withContext and withTimeout may discard resources on cancellation #3504

withContext and withTimeout may discard resources on cancellation #3504

dkhalanskyjb commented Oct 26, 2022 •

edited

Loading

qwwdfsad commented Oct 27, 2022

dkhalanskyjb commented Oct 28, 2022

dovchinnikov commented Oct 28, 2022 •

edited

Loading

qwwdfsad commented Oct 28, 2022 •

edited

Loading

dovchinnikov commented Oct 28, 2022

dovchinnikov commented Oct 28, 2022

alexandru commented Apr 23, 2023 •

edited

Loading

withContext and withTimeout may discard resources on cancellation #3504

withContext and withTimeout may discard resources on cancellation #3504

Comments

dkhalanskyjb commented Oct 26, 2022 • edited Loading

qwwdfsad commented Oct 27, 2022

dkhalanskyjb commented Oct 28, 2022

dovchinnikov commented Oct 28, 2022 • edited Loading

qwwdfsad commented Oct 28, 2022 • edited Loading

dovchinnikov commented Oct 28, 2022

dovchinnikov commented Oct 28, 2022

alexandru commented Apr 23, 2023 • edited Loading

dkhalanskyjb commented Oct 26, 2022 •

edited

Loading

dovchinnikov commented Oct 28, 2022 •

edited

Loading

qwwdfsad commented Oct 28, 2022 •

edited

Loading

alexandru commented Apr 23, 2023 •

edited

Loading