Suspend-safety - something to keep in mind

I am in the process of converting a larger codebase to suspending code and thinking about best practices and what would need to be accounted for. One thing that crossed my mind is the following:

In blocking code, if one knew that several threads may access a class, it had to be made thread-safe. This was something to keep in mind but was easily solved by wrapping the critical sections in synchronized { ... }.

In Kotlin with Coroutines, the same will be achieved with myMutex.withLock { ... }, fair enough.

But there is something else with suspending methods. Here is an example. Assume takeSomeSalt and takePot are mutating methods in other classes.

Blocking code:

fun cookPotatos(potatos: List<Potato>) {
  // make sure we are the only one in the kitchen
  synchronized { 
    val salt = saltJar.takeSomeSalt()
    stove.turnOn()
    val pot = cupboard.takePot()
    pot.add(tap.getWater())
    pot.add(salt)
    pot.add(potatos)
    stove.put(pot)
    Thread.sleep(25*60*1000)
    stove.turnOff()
  }
}

(Non-parallized) suspending code, looks the same. That’s good!: (By the way: who can turn this into parallelized code that is still easy to read? I tried it but failed)

suspend fun cookPotatos(potatos: List<Potato>) {
  // make sure we are the only one in the kitchen
  mutex.withLock {
    val salt = saltJar.takeSomeSalt()
    stove.turnOn()
    val pot = cupboard.takePot()
    pot.add(tap.getWater())
    pot.add(salt)
    pot.add(potatos)
    stove.put(pot)
    delay(25*60*1000)
    stove.turnOff()
  }
}

But here is the thing:
Suspending methods could be cancelled at any point during their execution, depending on the scope it had been called with. The receiver has no control over in which scope it had been called but at the same time, our Potato-Cooker-class needs to make sure that it itself is “suspend-safe”, i.e. manages its data in a way that does not lead to an inconsistent state.

This is not that obvious by just looking at the code, because they look like almost any other method. But it is something to worry about, basically for every single method that is suspending, this has to be kept in mind. I find it easy to miss, concept-wise.

Every critical section that is guarded by a mutex is critical for a reason. In this case, we might even burn down the kitchen if we leave the stove on :fire: . At least for my use case, basically every time I use a mutex, I also want it to not be cancellable.

What to do about it
I think it should be almost a pattern with Kotlin coroutines that every mutating method in a class that manages data by default implements a solution that makes the code non-cancellable, just as it is a pattern to first start with wrapping everything in synchronized when starting to make a class thread-safe (and optimize later).

There are two possibilities to make it non-cancellable, and I think the second is better.

  1. by wrapping the code in withContext(NonCancellable) { ... }. If I understand this correctly, the downside of this would be that if for example the coroutine was started by a View which should be cleared, it cannot be garbage collected until that non-cancellable is done. So while the potatos are cooking, the View is still around and leaking memory.
  2. by writing the method like this:
suspend fun cookPotatos(potatos: List<Potato>) = myScope.launch{ ... }.join()

Why even keep the suspend? This way, the potatos will be cooked independent of whether the View is still there, the only thing that is cancelled when the View is cleared is the join(), so same as if the method was fire and forget (like this:)…

fun cookPotatos(potatos: List<Potato>) = myScope.launch { ... }

…only that, the View will know when the potatos are done and could update its UI accordingly if it wanted. If the View doesn’t want that, but have it like a fire-and-forget, it could simply launch that method in its lifecycleScope.
So in this regard, the suspend fun f() = launch{ ... }.join() pattern is the most flexible.

Plus, it is consistent with how a mutating-non-cancellable method that returns a result would need to look like:

suspend fun cookPotatos(potatos: List<Potato>) = myScope.async{ ... }.await()

A resource for this is also what Google recommends for Android development -

So, what do you think? Would you also do it like that? Or would you prefer the NonCancellable pattern? And why?

2 Likes

One issue is exception handling with structured concurrency.

Let’s consider:

suspend fun cookPotatos(potatos: List<Potato>) = myScope.launch { sometimesThrows() }.join()

What happens when sometimesThrows() actually does throw?

  1. If myScope has a supervisor job or no job, the exception handler will be called
  2. If myScope has a regular job, then myScope will fail
    The cookPotatos caller never learns that the operation

Now let’s consider the async variant:

suspend fun cookPotatos(potatos: List<Potato>) = myScope.async { sometimesThrows() }.await()

What happens when sometimesThrows() actually does throw?

  1. If myScope has a regular job, then myScope will fail
  2. the await() call will fail notifying the cookPotatos caller

Another issue is that the async/await approach breaks structured concurrency and loses out on one of its guarantees: That no work is “leaked” when the operation returns (normally or exceptionally). When following structured concurrency, you know that all the work started by a suspend method finished once the method finished. Methods like withContext and coroutineScope very intentionally wait for all of the work to finish before returning in order to preserve the guarantees that you get with structured concurrency which makes code much easier to reason about.

On the flip-side considering memory, while the withContext approach might hold on to references depending on how it’s coded, you know that it’s not allocating anything in the background after it returns exceptionally.

You even potentially lose the guarantee that the code is even run at depending on how myScope is handled. Does your process wait for myScope to finish? If you are on Android for example, launching to a scope is not how to guarantee that code will run, you want to use WorkManager.

I’ll finish with a caution against conflating use cases. If you are trying to solve a “one cook in the kitchen” scenario, you are not talking about a critical section. For such scenarios, a Mutex could work, but you do not want to make such code NonCancellable. If you have some invariants for when a cook “leaves” like turning off the stove then that’s what try-finally block is for.

In general “critical sections” are small, fast, and will not include any suspending calls and it’s fine to wrap them in synchronized blocks even within coroutines (any “blocking” will be negligible). Critical sections ensure that the data is actually fully valid. A critical section will ensure that the stove knob and flame turn on/off together, it won’t make sure the dishes get cleaned after cooking.

If the synchronized code block is not small and fast, then synchronized blocks can actually end up waiting on each other and blocking the coroutine dispatchers from being able to do any work.

I’d take a very careful look at any place you want to use NonCancellable and check if a try-finally block would work instead.

2 Likes

Regarding exceptions, wouldn’t it all work as expected (i.e. exception handler of that scope will be called) if myScope uses a supervisor job?

Another issue is that the async/await approach breaks structured concurrency and loses out on one of its guarantees: That no work is “leaked” when the operation returns (normally or exceptionally).

So one got to choose: Leak the view (NonCancellable) or leak the work (use external scope). As the work is being done in the name of keeping the data consistent, in this case I guess it is an easy choice.

But true, sorry, the metaphor I chose really looks like a job for the WorkManager. Your metaphor

Critical sections ensure that the data is actually fully valid. A critical section will ensure that the stove knob and flame turn on/off together, it won’t make sure the dishes get cleaned after cooking.

is better and is actually what I meant: Ensure that the data that is managed by one class stays consistent.

Hmm, in on my actual code, I get a weak warning now though which make myself uncertain now. (Simplified:)

suspend fun createChangeset(source: String): Long {
    val tags = createChangesetTags(questType, source)
    return scope.async { // **
        val changesetId = withContext(Dispatchers.IO) { mapDataApi.openChangeset(tags) }
        openChangesetsDB.put(OpenChangeset(source, changesetId))
        changesetAutoCloser.enqueue(CLOSE_CHANGESETS_AFTER_INACTIVITY_OF)
        changesetId
    }.await()
}

(Explanation: The info that a changeset has been opened on the server side should be persisted in the local database too, so that section should not be cancellable … yeah I find it really hard to decide which code sections are critical enough to not be cancellable but this is another issue)

IntelliJ now suggests to replace scope.async{ ... }.await() with withContext(scope.coroutineContext) { ... }. But is this still the same? If the job that called this method is cancelled, will the code within that block be executed regardless in both cases? After all, in the former case, the block is not even a child of the job that called this method, or is it?

Edit: Actually, exactly this case is mentioned in the linked article above under the heading “What about something simpler?”. So, weird that IntelliJ outputs a weak warning if it is not the same.

Depends on the code. Let’s say the caller has a CoroutineExceptionHandler in its context and the called code launches coroutines in a supervisorScope. The behavior will differ.

If you have a CoroutineName, that will be lost while the code is executing.

If you are waiting for the result, then it typically you should be using the caller’s context.

Leaked work can cause a resource to be cleaned up before its done being used.

If you don’t want the View to stay alive, then just don’t reference it. Lambdas (even for coroutines) only hold onto the references for instances that they use. If you don’t need the view, then don’t reference it. If you need it for a little bit and then want to let go, you can use a var to capture the view reference and then null the var out when done.

No, they aren’t the same. withContext will merge the two contexts. scope.async does not use the callers CoroutineContext at all (not a suspend method). The IDE just thinks that you are more likely to want one behavior instead of the other so it’s making a recommendation.

That depends on if scope has a Job. GlobalScope for example has no Job, so if contexts merge, the the caller’s Job will be in the resulting CoroutineContext. It can be a little hard to reason about which is why it’s not recommended to switch the Job within a coroutine. withContext creates a new context that overwrites entries in the callers context with the context provided.

Correct

This is not a “critical section”. Consistent != Up-to-date-with-server. Consistent means the data makes sense (even if it’s old). For example, if you reference a DB row, that row should actually exists. You can use your local DB’s transaction API if multiple updates are involved to ensure that your local DB operations are atomic (they either fully complete as one operation or don’t do anything at all). Doing this will keep your database consistent.

I wish I could give some more useful advise along the lines of “Try this” (much more useful advice) instead of just giving details about coroutines methods but I don’t have a great grasp of your use case. It may be worth creating a post detailing your use case and asking how coroutines can best be leveraged.

Maybe you don’t need to wait for the operation at all and sending updates over a Channel to another coroutine would make more sense. Maybe the app is closing so you don’t need these operations to run at all since your app will re-sync the next time it opens. Or maybe you have a use case that the coroutines library simply doesn’t address yet and an additional coroutine helper method would make all the difference to other devs in your same situation.

Yeah you are right. On the example from my actual code, I reverted that decision for that method.

My use case would be… well, I am not sure how much I can simplify it without losing the use case.

I develop an app that downloads OpenStreetMap data, analyzes it, displays places where information is missing in the form of pins (“quests”) on a map. The user can solve these, which results in edits that are subsequently uploaded to OpenStreetMap. (The app’s name is StreetComplete.)

The big picture of the data from downloading, persisting data, merging data with diffs (edits) on that data, creating quests out of them and finally displaying them is here:


But I hope the big picture will not be necessary to understand the use case. Let’s zoom in to MapDataController:
MapDataController

MapDataController manages the OSM data downloaded from the OSM API and notifies all registered observers of any changes (f.e. after a new download finished or changes had been successfully uploaded). The class uses ElementDao and ElementGeometryDao to persist the data, they are simple CRUD stores. The former is the “raw” data from OSM, the latter needs to be created with more or less effort from that data, i.e. the latter implicitly references the first. So, whenever data is inserted or deleted onto MapDataController, the work that needs to be done that may not be cancelled is

  1. tell ElementDao to persist these,
  2. create geometry and tell ElementGeometryDao to persist these and finally
  3. tell all listeners on the data which data exactly has been inserted/deleted/changed (normal observer pattern) after step 1 and 2 are done

Point 3 is still critical because other data is created/updated/deleted triggered by that change of the data, at this point we are still far away from any View-related code. For example, the pins (which are also persisted) need to be created/updated/deleted when a change in the source data is done.

Because of separation of concerns, it is not the duty of MapDataController to

  1. know or care about how its *DAOs persist their data. It doesn’t know that it is a database, and which. So wrapping that operation in a transaction is not applicable. (WayDao does wrap its operations in a transaction)
  2. worry about that whatever work the listeners do is not cancelled. To decide whether that is cancellable and to ensure that it is not is the concern of these. It is just its concern that the updates get dispatched

Here is a code sample from MapDataController. Right now, there is not any measure on how to make this non-cancellable because I didn’t decide on a pattern that I will use for my app so far. As said, I tend towards having a private val scope = CoroutineScope(SupervisorJob()) start the critical work.

    // called by ElementEditsUploader after a change has been uploaded.
    // The updated elements (created/updated/deleted) are passed here
    suspend fun updateAll(elementUpdates: ElementUpdates) = mutex.withLock {
        val mapData = MutableMapData(elementUpdates.updated)
        val geometries = createGeometries(mapData) // suspends, a few moments
        val mapDataWithGeom = MutableMapDataWithGeometry(mapData, geometries)
        val oldElementKeys = elementUpdates.idUpdates.map { ElementKey(it.elementType, it.oldElementId) }
        val deletedElements = elementUpdates.deleted + oldElementKeys

        // TODO CRITICAL SECTION START

        coroutineScope {
            // persisting these can be done in parallel, but for each, first the 
            // things must be deleted, then added
            launch {
                geometryDB.deleteAll(deletedElements)
                geometryDB.putAll(geometries)
            }
            launch {
                elementDB.deleteAll(deletedElements)
                elementDB.putAll(mapData)
            }
        }
        coroutineScope {
            listeners.forEach { 
                launch { it.onUpdated(updated = mapDataWithGeom, deleted = deletedElements ) }
            }
        }
        // TODO CRITICAL SECTION END
    }

And I have many more such situations in the source code. Even when there are not several separate data stores managed by one class that need to be kept in sync, there are often listeners that must be notified of the change that was persisted to database.

Why not? Why can’t you recover later by downloading OpenStreetMap data again?

  1. Kinda seems like MapDataController does care, you just wish it didn’t :wink:. Is there a technical reason not to wrap the whole block in a withTransaction block or something similar (not sure what DAO tech you are using). If you really want these different DOAs to be totally separate, maybe they should be syncing with to OSM separately, instead of through MapDataController. If they do all need to update together, then they aren’t actually decoupled at all no matter how you organize them…
  2. If the only thing you care about is that the update is dispatched, why wait around for it to finish? Just change onUpdated to not suspend and the listener can launch with its own scope if it wants to. Though then you don’t know that everything got updated together … so maybe it really does need to worry?

Why? If you are cancelling, what stops you from resyncing the next time the app opens?

Software is often give and take. Abiding by best practices often comes with overhead. Optimizing code can result in more coupled code or more complex code. The abstractions that make code easier to change, can also add complexity.

Seems like maybe you are in a spot where you want all the data to sync all together as a transaction but that doesn’t fit the more decoupled architecture you’d prefer?

Keep in mind what your source of truth is (guessing it’s the server) and if you ever find your data isn’t in the state you want it to be in, you can recover by reading from your source of truth again.

Why not? Why can’t you recover later by downloading OpenStreetMap data again?

Why? If you are cancelling, what stops you from resyncing the next time the app opens?

Because it affects the internal state of the app. All manner of things might happen if the internal state is not consistent. For example, an updateAll might have deleted some elements. If that change did not propagate to all other data controllers that depend on that data, there may be for example leftover “quests” in storage that reference elements that do not exist anymore. And so on… I don’t really want to get creative here and see what else may all happen if the state becomes inconsistent.

Is there a technical reason not to wrap the whole block in a withTransaction block or something similar

Not really. However, other data controllers use other DAOs that do also interact with a file-based key-value store. So a transaction wouldn’t solve all use cases anyway. Plus, there is point 3 that I mentioned: The listeners must all be notified of the change which is still part of the section that should not be cancelled.

If they do all need to update together, then they aren’t actually decoupled at all no matter how you organize them…

They itself are decoupled. Only for application logic “beyond” MapDataController, they are not. (OpenStreetMap data has no geographic element per se, one needs to construct it out of that data. That is the job of the MapDataController, so the application logic beyond can work with the element+geometry).
For example, the quest pins on the map obviously have to have a position. Where do they get their position from? From the geometry of the OSM element associated with the quest. Edits made are made on an element. But the application logic assumes that elements are not randomly missing from the DB, which might be the case if the sync does not propagate through fully. See the “big picture” from my last post. There is a listener that listens on the listener of the listener, it is a chain of data controllers the data flows through.

If the only thing you care about is that the update is dispatched, why wait around for it to finish?

Not any too strong reason for that. But basically for the same reason I’d rather use

suspend fun cookPotatos(potatos: List<Potato>) = myScope.launch{ ... }.join()

than

fun cookPotatos(potatos: List<Potato>) = myScope.launch { ... }
  1. The caller might be interested in when the whole potato-cooking, or in this case, the uploading of an edit is done, which encompasses all of the upload, parsing of response, persisting of answer, notifying all observers plus whatever work is done by whoever observed this (generation of quest pins etc…). In this case, I know that the caller (the upload service) wants to know, if only for the silly reason to log how long the whole process took and the fact that progress bar is listening on the upload service and wants to be notified when “the whole thing” is through.
  2. for consistency in the interface - rather have all marked as suspending that do suspending work, and for consistency with “mutable non-cancellable method that returns a result” (suspend fun x() = scope.async().await())

Seems like maybe you are in a spot where you want all the data to sync all together as a transaction but that doesn’t fit the more decoupled architecture you’d prefer?

Well before coroutines it was a fit, and I would say it still is. I just need to make sure that certain things (propagation of “my data changed!” notifications) don’t get cancelled.
The source of truth for each data controller is pretty much the data controller to its left in the big picture I posted above. For the MapDataController, it is whatever comes in from the OSM API, for the MapDataWithEditsController, it is whatever comes in from the MapDataController and ElementEditsController, for the OsmQuestController, it is whatever comes in from the MapDataWithEditsController (and NotesWithEditsController). So yes, a re-download of an area would solve any inconsistencies in the data (for that area). I’d just like to avoid inconsistencies in the first place.