About suspend functions that spawn coroutines that persist

I understand that it is generally not advisable for suspend functions to spawn coroutines that don’t end when the function does. One example would be a suspend function that starts a background coroutine for some sort of regular network ping or keep-alive mechanism. Doing that can break structured concurrency and lead to resource leaks.

However, sometimes functions have to start coroutines that keep running until some sort of associated stop function cancels these coroutines. The recommendation I’ve seen so far has been to use non-suspend functions, have them spawn coroutines with launch or async, and return the Job or Deferred. So far so good.

But - in that function, suspend functions may have to be called. One example is a handshake procedure. So I am thinking that in some cases, it may be OK to use a suspend function.

Here is an example. This is roughly how my code looks like in an application I am writing.

fun connectAsync(scope: CoroutineScope): Deferred<Unit> {
    return scope.async {
        // suspending functions are called here, in the newly created coroutine
        device = openDevice()
        device.sendPacket(someHandshakePacket) // sendPacket is a suspend function that internally uses the IO dispatcher
        val response = device.receivePacket() // receivePacket is a suspend function that internally uses the IO dispatcher 
        processResponse(response)
        // [...] some more handshake processing here, including invocation of various suspend functions

        keepaliveJob = scope.launch {
            runKeepAliveLoop()
        }
    }
}

suspend fun disconnect() {
    keepaliveJob?.cancelAndJoin()
    keepaliveJob = null
    
    device?.sendPacket(createSessionEndPacket())
    device?.closeDevice()
    device = null
}

Since by convention suspending functions should not accept a CoroutineScope, I made connectAsync non-suspending. It is important though to be able to wait for the connect procedure to finish, hence the Deferred return value - the user can then run await() in a coroutine to wait for the connection to be established. (This is also why I can’t just stick the sendPacket and receivePacket calls inside the coroutine that is associated with keepaliveJob.)

I doubt though that insisting on making connectAsync non-suspending here really is helpful, especially if I have to wait for the connection to be established. Look at a suspending variant of connectAsync (disconnect stays the same):

suspend fun connectAsync(scope: CoroutineScope) {
    device = openDevice()
    
    device.sendPacket(someHandshakePacket) // sendPacket is a suspend function that internally uses the IO dispatcher
    val response = device.receivePacket() // receivePacket is a suspend function that internally uses the IO dispatcher 
    processResponse(response)
    // [...] some more handshake processing here, including invocation of various suspend functions

    keepaliveJob = scope.launch {
        runKeepAliveLoop()
    }
}

It looks similar, except that it suspends the calling coroutine instead of one that is newly spawned by async. In both versions, a background coroutine is started. In both versions, that coroutine’s lifetime is managed by disconnect and the “scope” coroutine scope, and not by the function itself. (Arguably though, the suspending variant shouldn’t have an “Async” suffix in its name.) But - in the second version, I don’t need to worry about having to call await.

Furthermore, the non-suspending variant becomes more cumbersome to use if in a higher level class additional steps have to be taken after connectAsync finishes. Example:

fun higherLevelConnectAsync(scope: CoroutineScope): Deferred<Unit> {
    return scope.async {
        connectAsync(scope).await()
        // additional steps that are necessary at a higher level are done here
    }
}

Contrast this with a suspending variant:

suspend fun higherLevelConnectAsync(scope: CoroutineScope) {
    connectAsync(scope)
    // additional steps that are necessary at a higher level are done here
}

The suspending variant composes much nicer and cleaner in my opinion.

To sum up, my question is: If a suspend function does spawn a coroutine that persists even after the function finishes, but the lifetime of that coroutine is well defined and controlled, is there any other reason why this shouldn’t be done, and why in the examples above connectAsync still should never be suspending?

This is probably a matter of taste, but I prefer to write everything as suspend functions. I rarely return Job/Deferred from public functions. Even if I have a task that is meant to run indefinitely, for example it is a function that starts some kind of a service, I still create suspend fun runService(). This way everything is synchronous by default and the caller can go async by explicitly using launch(). If anyone thinks this is a bad practice, I’m open to discussion and suggestions.

Also, I think it is a bad idea to keep the code for connecting and for keep alive in the same function. This way it is hard/impossible to detect when we finished connecting and handshaking, so we can start sending/receiving the data.

1 Like

In my case, as soon as the handshaking is over, the keep-alive loop must be started. If not, the other side will eventually terminate the connection. If I design my connect function as a suspending one, then it is pretty clear - if it finishes normally (= without an exception), then the connection and handshaking are over, keepalive is running, communication can commence.

But I share your view about Job/Deferred. I have been doing that because of that convention, but in a case like this, where there’s handshaking involved (which in turn involves suspending function), having to use them feels forced.

I’d also like to ping @nickallendev , since he helped me with a similar question before. In particular, I am referring to this quote:

Your suspend methods should not take a CoroutineScope and should wait for all work to finish. Methods that launch coroutines without waiting should not suspend and should take a CoroutineScope parameter or receiver.

This is the convention I was talking about. Wouldn’t my case be sort of a special one that warrants an exception to that convention?

1 Like

I think this convention makes sense, because suspending function that accepts a CoroutineScope and/or returns Job is confusing to its users. Whenever we implement a function that runs for a prolonged time, we can make it either synchronous (suspend) or asynchronous (CoroutineScope, Job). Both options are explicit to the caller and the caller knows how this function handles its lifetime or how to wait for it to finish.

suspend with CoroutineScope is pretty confusing, it is like the function is synchronous and asynchronous at the same time. It’s not entirely blue and not entirely red, so… purple? The caller may suspect that there are 2 separate long-running tasks. But they may assume everything has been finished after the function returned or that it returns immediately, even if it is a suspend function.

Even if we use this technique as a convenient way to perform 2 sequential tasks, I would consider it really as a hack. What if we need to run 3 such tasks? If we can’t really split this function into multiple separate functions then maybe it would be better to return 2/3 Job objects? Or return a Flow/Sequence of events. It would be more explicit than suspend + async.

Maybe this is just my understanding of coroutines and suspending functions, I’m definitely not an expert here. But Roman Elizarov said something similar in one of his articles:

Suspending functions, on the other hand, are designed to be non-blocking and should not have side-effects of launching any concurrent work. Suspending functions can and should wait for all their work to complete before returning to the caller

Yes, this is true. I think that in my example, the key part is to accurately specify what the coroutinescope is for, and what the suspending function does. For example, it would be weird to pass a coroutinescope just to start coroutines inside the suspend function without any good reason (even weirder if the function waits for those coroutines to finish - this would easily break structural concurrency). But, if by definition, part of a function’s job is to set up background functionality like keep-alive to ensure that followup operations work correctly, then I think this can work and not be confusing. In this particular example, I can just call the argument something like keepaliveLoopScope.

Also, let’s not forget that the connect counterpart, disconnect, would teardown the coroutine that runs the keepalive loop, and it is very intuitive and logical that disconnect must be called as part of a clean shutdown in a program, so this would handle coroutine cleanup in a straightforward way.

One other option of course would be to have a separate, non-suspending startKeepalive function, along with a stopKeepalive counterpart. However, I think that is actually more confusing, because it allows for a weird state if connect is called but not startKeepalive. This would set up something that would eventually cause subsequent IO operations to fail because the remote side would terminate the connection. This implies then that the keepalive loop must be started if a connection is set up. Either, both connect and keepalive are run, or neither are. So, might as well integrate startKeepalive into connect to rule out these weird undefined mixed states where only one of them is started.

Maybe something like fun CoroutineScope.startConnection() (or suspend fun runConnection()) and add a separate function: suspend fun waitConnected()? So it could be used like this:

service.startConnection()
service.waitConnected()

// do something

service.disconnect()

Well that would be similar to connectAsync and then calling the returned Deferred’s await() function, right?

However, I do admit that passing a coroutinescope to a connect function for a keepalive coroutine still is a little weird as far as structural concurrency is concerned, since it may be confusing to the user to know just what scope they should pass to, what scope would make sense. So perhaps in this case it would then make sense to really set up a completely internal coroutinescope in the connect function, like this:

internalScopeMainJob = SupervisorJob()
internalScope = CoroutineScope(Dispatchers.Default + internalScopeMainJob)

Since the lifecycle of the keepalive coroutine would anyway be tied to the connect and disconnect calls, it would be straightforward to do this in connect and then call internalScopeMainJob.cancelAndJoin in disconnect. This would set up a separate coroutine hierarchy, but would be purely internal. Thoughts?

Yes, but honestly, I consider even your first example confusing from the caller perspective. If I see a function like this:

fun connectAsync(scope: CoroutineScope): Deferred<Unit>

Then I assume this function doesn’t do anything more than connecting and that after awaiting on deferred, nothing else is running in the background. But actually, this sneaky function attached some additional subtasks to my scope. I would not expect that to happen.

startConnection(), waitConnected(), stopConnection() (or start/stopService) is different, because it is clear on the fact that there is something long-running going on. The caller doesn’t have to know internals or what is this long-running task. They don’t have to know about any keep-alive stuff. All they need to know is that they started something and they have to stop it later.

But I think I said enough and further discussion between two of us would be unfruitful. We need to wait for others opinions :slight_smile:

That is a good point, didn’t consider that! It is quite sneaky indeed. I like to extensively document API functions and clarify such things in there, which is perhaps why I missed this (the code not being self explanatory, which isn’t a good thing).

And yeah, it would be great if others could share their opinions as well. Just one more question - what would waitConnected do in your example? Where would a keep-alive loop be started?

It could be implemented for example like this:

class MyConnection {
    private val connectedJob = Job()

    fun CoroutineScope.startConnection() = launch {
        openConnection()
        performHandshake()

        connectedJob.complete()

        runKeepAliveLoop()
    }
    
    suspend fun waitConnected() {
        connectedJob.join()
    }
}

This approach has one disadvantage comparing to “suspend + async” function. It allows to perform operations before the connection has been fully established. The user has to be aware to use waitConnected(). Maybe CoroutineScope.startClient(), suspend connect(), suspend stopClient() would be better. And/Or make startClient() a factory function, not a member function. Then you can even make it AutoCloseable to use it like this:

startClient().use { client ->
    client.connect()
    // use client
}

Although this is tricky as close() can’t be suspend. It really depends on your case and taste :slight_smile:

1 Like

Every rule has exceptions, right?

suspend fun <T> Flow<T>.stateIn(scope: CoroutineScope): StateFlow<T>

If you don’t provide an initial value, stateIn suspends until it receives one. That’s in kotlinx-coroutines.

100% agree

You could move the other operations to a MyEstablishedConnection class returned from waitConnected. Or the other operations could call waitConnected internally.

A callback API would be really clear, if you can get away with it:

suspend fun connect(block: suspend MyConnection.() -> Unit) {
    val connection = setupConnection()
    coroutineScope {
        launch { connection.keepAliveLoop() }
        connection.block()
    }
    connection.cleanup()
}
    

It’s a connectionScope, with a lifetime that matches the connection.

It’s perfectly fine to write a suspending connectAsync, but a CoroutineScope bounds the lifetime of a coroutine, and that’s not what you want to use it for. You want the connection process to be complete by the time connectAsync returns.

If, for some reason, you really don’t want to run the connection process in the same context as the connectAsync function itself, then you should probably take a Dispatcher argument, not a CoroutineScope, or maybe don’t take anything and just use an appropriate Dispatcher. Note that I say “probably”, because it really depends on the reason why you don’t want to run the connection in the caller’s context.

Hmm, why Dispatcher? Dispatcher is for a much different thing. Structured concurrency is handled by passing CoroutineScope (technically Job stored in CoroutineContext of the scope). Passing scopes either as a regular param or a receiver, is a common technique for launching coroutines in a context specified by the caller.

Yes, exactly. If you want the connection coroutine to be done by the time the connect call returns, then there is no problem with inheriting the coroutine context from the caller. Its job can be a child of the parent job.

Why is the scope parameter desired at all, in that case?

The most common valid reason is probably because the connection coroutine should be run on a different dispatcher.

Keep in mind that the connect function must start the keepalive coroutine. It is absolutely essential. A connection without keepalive is an error.

So far, spawning an internal coroutine scope seems to work well (I mean this method). The downside is that that scope isn’t a child of the connect caller, but explicitly cancelling the internal scope in the disconnect function prevents leaks.

Starting a keep-alive coroutine is a good reason for using a separate CoroutineScope, but it’s not good practice to take that in a parameter to connect.

Are all the callers required to provide the same CoroutineScope? What happens if they don’t? Either way, your API is weird. That scope should be constructor-injected, or created and maintained along with the other data structures that are used to manage the kept-alive connections.

Yeah, come to think of it, the internal scope approach makes more sense overall. That means there’s no CoroutineScope argument, and instead, connect() creates the Job & scope. The lifespan of the coroutine that runs keepalive is logically tied to the lifespan of the overall connect session, not to an external scope, so it would be odd to pass such an external coroutine scope as an argument which would allow for cancelling coroutines from the outside. Imagine this case for example: After the connect() call, keepalive runs, but the coroutine scope that was passed to connect() is cancelled. The connection is now in an odd state, since the keepalive coroutine gets cancelled, but the connection is still open. By starting a separate internal scope, this is avoided.

I often design my services in a way that they receive optional CoroutineScope in their constructor/factory and if not provided then they create their own scope. Also, I make them AutoCloseable. Or, as I said earlier, I run the service synchronously within a suspend function (not sure if the latter isn’t an anti-pattern).

Structured concurrency was introduced for a reason. It is very helpful when developing asynchronous code, it makes implementation much simpler and more reliable. By creating the CoroutineScope inside the service we basically opt-out of structured concurrency and we fall back to a classic “remember to close me in a case of any errors or cancellations” approach. We have to pay more attention to designing everything properly and we’re risking leaking coroutines.

While I am happy with my internal scope, I do agree with this. I only did the internal scope after seeing that it fits my use case well and after I made absolutely sure that the scope is properly cancelled and cleaned up in disconnect(). I am in fact still using structured concurrency. It just isn’t a single hierarchy anymore, but two separate ones. Adding something like AutoCloseable to such a construct is strongly recommended though, yeah.