Originally, I noticed that in a single producer, multiple consumer scenario, performance in Kotlin using buffered channels seems mostly fine compared to some other languages, but disappoints in case the jobs handed to consumers turn out to be trivial. Obviously, performance is not going to be brilliant if you delegated only trivial jobs to consumers (shouldn’t have bothered to introduce channels and coroutines). There should be some overhead the more consumers are standing by idly. With unbuffered channels, that seemed better.
But taking a closer look, I get reproducible results but can’t get a grip on why. Here is an example benchmark (full project at github):
import kotlinx.coroutines.*
import kotlinx.coroutines.channels.ReceiveChannel
import kotlinx.coroutines.channels.produce
class CoroutinesSingleProducerDemo {
private data class Job(
val id: Int,
)
@OptIn(ExperimentalCoroutinesApi::class)
private fun CoroutineScope.produceJobs(jobs: Int, capacity: Int): ReceiveChannel<Job> =
produce(capacity = capacity) {
(1..jobs).forEach {
send(Job(it))
}
}
private fun CoroutineScope.launchWorker(
id: Int,
channel: ReceiveChannel<Job>,
load: Int,
logMain: (msg: String) -> Unit,
logDetail: (msg: String) -> Unit
) = launch(Dispatchers.Default) {
logMain("worker $id signing on")
var handled = 0
for (job in channel) {
repeat(load) { logDetail("worker $id taking on job ${job.id}") }
handled += 1
}
logMain("worker $id handled $handled jobs")
}
fun explore(
load: Int,
jobs: Int,
capacity: Int,
workers: Int,
logMain: (msg: String) -> Unit,
logDetail: (msg: String) -> Unit
) {
runBlocking {
val channel = produceJobs(jobs = jobs, capacity = capacity)
(1..workers).forEach { launchWorker(it, channel, load, logMain, logDetail) }
}
}
}
On my system, this says something like:
(capacity) (jobs) (load) (workers) Mode Cnt Score Error Units
0 100000 1 1 avgt 5 1301.509 ± 268.491 ms/op
0 100000 1 2 avgt 5 735.208 ± 20.198 ms/op
0 100000 1 3 avgt 5 353.848 ± 30.070 ms/op
0 100000 1 4 avgt 5 171.408 ± 2.790 ms/op
0 100000 1 5 avgt 5 160.317 ± 5.282 ms/op
0 100000 1 6 avgt 5 189.555 ± 16.053 ms/op
0 100000 1 7 avgt 5 192.795 ± 21.723 ms/op
0 100000 1 8 avgt 5 191.741 ± 13.037 ms/op
0 100000 1 9 avgt 5 191.065 ± 3.491 ms/op
0 100000 1 33 avgt 5 200.666 ± 7.975 ms/op
0 100000 1 99 avgt 5 189.293 ± 7.035 ms/op
0 100000 100 1 avgt 5 1351.077 ± 49.029 ms/op
0 100000 100 2 avgt 5 905.740 ± 12.169 ms/op
0 100000 100 3 avgt 5 588.280 ± 31.962 ms/op
0 100000 100 4 avgt 5 490.056 ± 99.719 ms/op
0 100000 100 5 avgt 5 250.423 ± 7.781 ms/op
0 100000 100 6 avgt 5 267.402 ± 8.331 ms/op
0 100000 100 7 avgt 5 235.778 ± 22.794 ms/op
0 100000 100 8 avgt 5 219.271 ± 2.810 ms/op
0 100000 100 9 avgt 5 204.271 ± 20.159 ms/op
0 100000 100 33 avgt 5 81.621 ± 4.595 ms/op
0 100000 100 99 avgt 5 58.280 ± 1.373 ms/op
64 100000 1 1 avgt 5 28.298 ± 4.045 ms/op
64 100000 1 2 avgt 5 20.843 ± 0.554 ms/op
64 100000 1 3 avgt 5 42.737 ± 11.823 ms/op
64 100000 1 4 avgt 5 88.495 ± 16.030 ms/op
64 100000 1 5 avgt 5 123.849 ± 42.544 ms/op
64 100000 1 6 avgt 5 173.594 ± 8.626 ms/op
64 100000 1 7 avgt 5 170.264 ± 39.190 ms/op
64 100000 1 8 avgt 5 198.890 ± 139.335 ms/op
64 100000 1 9 avgt 5 177.951 ± 9.137 ms/op
64 100000 1 33 avgt 5 201.769 ± 46.395 ms/op
64 100000 1 99 avgt 5 193.591 ± 14.675 ms/op
64 100000 100 1 avgt 5 245.485 ± 2.611 ms/op
64 100000 100 2 avgt 5 115.320 ± 0.479 ms/op
64 100000 100 3 avgt 5 74.766 ± 0.434 ms/op
64 100000 100 4 avgt 5 60.157 ± 9.908 ms/op
64 100000 100 5 avgt 5 51.021 ± 4.842 ms/op
64 100000 100 6 avgt 5 62.580 ± 3.875 ms/op
64 100000 100 7 avgt 5 66.906 ± 7.133 ms/op
64 100000 100 8 avgt 5 66.110 ± 4.750 ms/op
64 100000 100 9 avgt 5 64.027 ± 2.421 ms/op
64 100000 100 33 avgt 5 59.401 ± 2.280 ms/op
64 100000 100 99 avgt 5 55.285 ± 1.774 ms/op
We have a increasing number of workers (consumers) in 4 sections: unbuffered (capacity 0) or buffered (capacity 64), tiny load or 100-fold load per job. The first section (tiny load over an unbuffered channel) surprisingly looks just like mama said: on my 6 core CPU, the fastest and most stable choice is 1 producer and 5 consumers. Too few consumers hurts badly, too many consumers doesn’t matter much.
In section 2, we intensify the load on the workers. At first sight, logic rules: it take more time to do more work, and 5 workers is still optimal. But adding lots of workers skyrockets execution. Compared with section 1, if you put 99 coroutines at work, they’ll finish 3 times earlier if you ask them to do 100 times more work. What?
Section 3, buffered channel and tiny load per job. I take no issue with the fact it’s faster than unbuffered, despite the channel buffer management that’s going on, and that the ideal number of consumers seems to be 2: because the work is trivial and because of all kinds of synchronization interference. But why, when we bring in up to 99 workers, does it take 10× longer? The number interactions with the channel stays the same (and that’s almost all there is to do here). Sure, there is probably more context switching, because there is more context; but if that made such a difference, why does that not happen in the unbuffered case (and for buffer capacity 1, which I left out here)?
Section 4 repeats the surprise of section 2: with many workers, doing lots more work means we’re many times faster to the finish line.
What does that other coroutine contender Go say about this? If we `go run coroutines-spmc-benchmark.go`, I get:
100000 jobs, capacity 0, 1 workers: 59 ms
100000 jobs, capacity 0, 1 workers: 49 ms
100000 jobs, capacity 0, 1 workers: 50 ms
100000 jobs, capacity 0, 5 workers: 52 ms
100000 jobs, capacity 0, 5 workers: 50 ms
100000 jobs, capacity 0, 5 workers: 50 ms
100000 jobs, capacity 0, 33 workers: 55 ms
100000 jobs, capacity 0, 33 workers: 56 ms
100000 jobs, capacity 0, 33 workers: 56 ms
100000 jobs, capacity 0, 99 workers: 54 ms
100000 jobs, capacity 0, 99 workers: 53 ms
100000 jobs, capacity 0, 99 workers: 54 ms
100000 jobs, capacity 64, 1 workers: 25 ms
100000 jobs, capacity 64, 1 workers: 25 ms
100000 jobs, capacity 64, 1 workers: 25 ms
100000 jobs, capacity 64, 5 workers: 27 ms
100000 jobs, capacity 64, 5 workers: 28 ms
100000 jobs, capacity 64, 5 workers: 29 ms
100000 jobs, capacity 64, 33 workers: 37 ms
100000 jobs, capacity 64, 33 workers: 38 ms
100000 jobs, capacity 64, 33 workers: 36 ms
100000 jobs, capacity 64, 99 workers: 46 ms
100000 jobs, capacity 64, 99 workers: 44 ms
100000 jobs, capacity 64, 99 workers: 45 ms
This roughly corresponds to the Kotlin benchmark for a tiny load, so section 1 and 3. It paints a similar picture, but is a lot more reasonable for buffered channels: instead of a 10× increase due to idle workers, it’s just 1.5×.
Of course, I’m fairly new to Kotlin and could be doing something wrong. But right now, it suggests to me that Kotlin coroutines that are not busy working, keep themselves amused by blocking the door to the employment office that is a buffered channel.