Best Practices for parallelizing different types of IO

jmitchem · May 23, 2022, 11:53pm

As far as general design, I’m still not 100% on the proper solutions for parallelizing IO operations.

At a basic level, we are building a heavyweight cache.

We are pulling data via HTTP requests from external services
We are storing that data in a database
We are retrieving from the database and aggregating the data for our consumers

Each of those steps can be parallelized, but using the same Dispatcher for all of them has proven to be problematic. Typically we will have a few hundred simultaneous requests for external data.

One way to structure this is to run the “get and save” in parallel:

coroutineScope {
   val data = getDataFromService(parameters)
   repository.saveData(data)
}

Another way is to parallelize all of the gets, then save the collected results in a batched database call:

coroutineScope {
   data = allRequestParameters.map( async{ getDataFromService(it) } ).awaitAll().flatten()
   repository.saveData(data)
}

In general, the second is the more performant of the two because of the batched database requests.

However, we’ve run into some problems with a shared Dispatcher, because our repository save has a few async tasks.

suspend fun saveData(data) {
   coroutineScope {
      val saveDataTask = async { writeToDatabase(data) }
      val saveMetadataTask = async { writeMetadataToDatabase(data) }
      listOf(saveDataTask, saveMetadataTask).awaitAll()
   }
}

This will end up stalling the database writes for awhile, until we’ve gotten nearly all of the external service data back. And using that same dispatcher means our consumers have to wait for all of these to complete before any endpoint can return data.

Because of this, does it make sense to have a different dispatcher for the database tasks than the http client tasks?

Likewise, I’m familiar with the “reader threadpool, writer threadpool” model for the database operations. Does it make sense to have a “reader dispatcher” and a “writer dispatcher”?

Just trying to make sense of the right models to use in Kotlin.

Right now it seems to make sense to use dedicated dispatchers for each of these types:

a dispatcher for external service API calls (possibly a different dispatcher per service)
a database reader dispatcher
a database writer dispatcher

Please let me know if I’m thinking about this completely wrong.

broot · May 24, 2022, 7:04am

Generally speaking, yes, if you know you need to use IO heavily and perform tens/hundreds blocking operations at the same time, then it makes sense to create a custom thread pool / dispatcher for it. And if you want to separate various components of your application, so one of them won’t affect the performance of another one, that means separate dispatchers.

You can also look at limitParallelism(). Or you can create a queue of tasks explicitly with channels to have more control over queuing and execution process.

But I think there are no universal solutions for these kinds of problems. Your case is not the trivial one and it requires custom solutions, experimenting and fine-tuning for your specific workload.

jmitchem · May 24, 2022, 11:36pm

Thank you for the answer, and the additional pointers for research. limitParallelism is especially interesting.

Many years ago, I was working exclusively in the .NET world with Tasks and the Task Parallel Library. I didn’t have much visibility under the hood, but I don’t remember needing to think in terms of Dispatchers like I’ve needed to in Kotlin. The problem space was somewhat different though; I was limited more by disk IO than network operations. Some of these problems are new and unexpected.

Topic		Replies	Views
Coroutines design advice Support	8	1655	June 9, 2020
Coroutines: Correct use for class managing data Libraries	4	1944	March 12, 2021
Kotlin Coroutines, suspend, and Java NIO Support	1	2766	April 16, 2021
Spring Web Server using coroutines Support	3	2806	September 18, 2021
Is it ok to dump all thread blocking tasks into the IO dispatcher? Libraries	2	963	April 1, 2019

Best Practices for parallelizing different types of IO

Related topics