Coroutine/Job that doesn't join its children when cancelled

Hi, sorry for a long post, but this is because I did some research already, I have some initial thoughts and a POC. You may skip uninteresting parts.

Short story

I’m looking for a legitimate way to abandon my children on demand. Well… that didn’t really sound right…

I mean something similar to coroutineScope(), but if the outer coroutine is cancelled, then only signal the inner coroutine to cancel and don’t wait for it to complete. My questions are:

  1. Is it at all a bad idea, without any exceptions and even if I understand the consequences? I put some arguments in Solution.
  2. What is the best way to implement this? What about my implementations below?

Problem

I’m experimenting with different ways of handling blocking operations (mostly IO) with coroutines. While Dispatchers.IO solves the problem of blocking the current thread, it doesn’t really help with cancelling, aborting the operation, enforcing a timeout, etc.:

withTimeout(1000) {
    withContext(Dispatchers.IO) {
        waitInterruptibly()
    }
}

This code works indefinitely, stalling every parent coroutine in the chain. The worst part is: we can’t do too much about it. If the operation is blocking and does not provide means to cancel it, we are basically screwed.

Solution

Solutions I’m considering are: interrupting the blocked thread (runInterruptible()), closing IO resource if possible, but also just abandoning the blocked coroutine and resuming parent coroutines.

Yes, I know, it breaks structured concurrency, it leaks background tasks, etc., but… is it really that bad? I think this is not that different than creating CoroutineScope() and invoking cancel() without joining it. Documentation suggests this is a proper way of handling cancellations. lifecycleScope in Android does the same. In both cases we actually leak coroutines that are in cancelling state, possibly leaking some resources as well, but we ignore this fact, assuming this is just fine.

Also, in the case of IO we often can assume that the blocking operation will eventually fail anyway, with some kind of internal timeout. We just don’t want to wait for it and we don’t care when exactly it will happen. Furthermore, in many cases resuming the stalled coroutine could actually help us clean up the situation, because we already have resource-cleaning code there. See this example:

sock.getInputStream().use { input ->
    withTimeout(10000) {
        input.readNBytes(16)
    }
}

This causes a situation similar to a deadlock. We can’t close the stream, because we wait on read and we can’t fail the read, because we never close the stream. If we abandon the coroutine waiting on the read operation, it will be leaked for mere milliseconds and then it will finish with exception. Everything will clean up nicely and without any additional code for handling cancellations.

Implementation

I think the cleanest solution would be to have a Job implementation that works the same as regular Job, but doesn’t join its children on cancel. I explored Job API, but it seems impossible without hacking. Almost everything there is internal stuff.

Another solution is to not create parent-child relation and pass results and failure states between them manually. I don’t like this solution as I still consider these coroutines “a family”, but at least this is possible with the public API.

I came with two possible implementations, but I’m not confident they’re entirely correct:

@OptIn(DelicateCoroutinesApi::class)
suspend fun <R> selfishScope(block: suspend CoroutineScope.() -> R): R {
    val deferred = GlobalScope.async(coroutineContext.minusKey(Job)) { block() }
    return try {
        deferred.await()
    } catch (e: CancellationException) {
        deferred.cancel(e)
        throw e
    }
}

This one is simpler, but the drawback is that we can’t distinguish whether an exception originated from the parent or from the child. If child was cancelled, we catch the exception and cancel it again with its own exception. I’m not sure how bad this is.

@OptIn(DelicateCoroutinesApi::class, ExperimentalCoroutinesApi::class)
suspend fun <R> selfishScope(block: suspend CoroutineScope.() -> R): R {
    return suspendCancellableCoroutine { cont ->
        val child = GlobalScope.async(cont.context.minusKey(Job)) { block() }

        child.invokeOnCompletion { e ->
            if (e != null) {
                cont.resumeWithException(e)
            } else {
                cont.resume(child.getCompleted())
            }
        }

        cont.invokeOnCancellation { e ->
            child.cancel(e as? CancellationException)
        }
    }
}

This is probably closer to the proper solution. I’m just less confident about it, because it is pretty low-level.

Both solutions seem to work and they do what I expect.

In both cases I’m not entirely sure about the context I pass to async(). I try to mimic the behavior of coroutineScope() and other utils, so I inherit the parent scope. I just don’t know if there are other elements than Job that I should remove.

Also, I’m not sure how does it behave in the case parent is cancelled at the same time the child completes. I believe it should be fine, I’m just not 100% sure about it.

What do you think?

1 Like