Why do extra consumers considerably slow down or speed up processing over channels

Originally, I noticed that in a single producer, multiple consumer scenario, performance in Kotlin using buffered channels seems mostly fine compared to some other languages, but disappoints in case the jobs handed to consumers turn out to be trivial. Obviously, performance is not going to be brilliant if you delegated only trivial jobs to consumers (shouldn’t have bothered to introduce channels and coroutines). There should be some overhead the more consumers are standing by idly. With unbuffered channels, that seemed better.

But taking a closer look, I get reproducible results but can’t get a grip on why. Here is an example benchmark (full project at github):

import kotlinx.coroutines.*
import kotlinx.coroutines.channels.ReceiveChannel
import kotlinx.coroutines.channels.produce

class CoroutinesSingleProducerDemo {
    private data class Job(
        val id: Int,
    )

    @OptIn(ExperimentalCoroutinesApi::class)
    private fun CoroutineScope.produceJobs(jobs: Int, capacity: Int): ReceiveChannel<Job> =
        produce(capacity = capacity) {
            (1..jobs).forEach {
                send(Job(it))
            }
        }

    private fun CoroutineScope.launchWorker(
        id: Int,
        channel: ReceiveChannel<Job>,
        load: Int,
        logMain: (msg: String) -> Unit,
        logDetail: (msg: String) -> Unit
    ) = launch(Dispatchers.Default) {
        logMain("worker $id signing on")
        var handled = 0
        for (job in channel) {
            repeat(load) { logDetail("worker $id taking on job ${job.id}") }
            handled += 1
        }
        logMain("worker $id handled $handled jobs")
    }

    fun explore(
        load: Int,
        jobs: Int,
        capacity: Int,
        workers: Int,
        logMain: (msg: String) -> Unit,
        logDetail: (msg: String) -> Unit
    ) {
        runBlocking {
            val channel = produceJobs(jobs = jobs, capacity = capacity)
            (1..workers).forEach { launchWorker(it, channel, load, logMain, logDetail) }
        }
    }
}

On my system, this says something like:

(capacity)  (jobs)  (load)  (workers)  Mode  Cnt     Score     Error  Units
         0  100000       1          1  avgt    5  1301.509 ± 268.491  ms/op
         0  100000       1          2  avgt    5   735.208 ±  20.198  ms/op
         0  100000       1          3  avgt    5   353.848 ±  30.070  ms/op
         0  100000       1          4  avgt    5   171.408 ±   2.790  ms/op
         0  100000       1          5  avgt    5   160.317 ±   5.282  ms/op
         0  100000       1          6  avgt    5   189.555 ±  16.053  ms/op
         0  100000       1          7  avgt    5   192.795 ±  21.723  ms/op
         0  100000       1          8  avgt    5   191.741 ±  13.037  ms/op
         0  100000       1          9  avgt    5   191.065 ±   3.491  ms/op
         0  100000       1         33  avgt    5   200.666 ±   7.975  ms/op
         0  100000       1         99  avgt    5   189.293 ±   7.035  ms/op
         0  100000     100          1  avgt    5  1351.077 ±  49.029  ms/op
         0  100000     100          2  avgt    5   905.740 ±  12.169  ms/op
         0  100000     100          3  avgt    5   588.280 ±  31.962  ms/op
         0  100000     100          4  avgt    5   490.056 ±  99.719  ms/op
         0  100000     100          5  avgt    5   250.423 ±   7.781  ms/op
         0  100000     100          6  avgt    5   267.402 ±   8.331  ms/op
         0  100000     100          7  avgt    5   235.778 ±  22.794  ms/op
         0  100000     100          8  avgt    5   219.271 ±   2.810  ms/op
         0  100000     100          9  avgt    5   204.271 ±  20.159  ms/op
         0  100000     100         33  avgt    5    81.621 ±   4.595  ms/op
         0  100000     100         99  avgt    5    58.280 ±   1.373  ms/op
        64  100000       1          1  avgt    5    28.298 ±   4.045  ms/op
        64  100000       1          2  avgt    5    20.843 ±   0.554  ms/op
        64  100000       1          3  avgt    5    42.737 ±  11.823  ms/op
        64  100000       1          4  avgt    5    88.495 ±  16.030  ms/op
        64  100000       1          5  avgt    5   123.849 ±  42.544  ms/op
        64  100000       1          6  avgt    5   173.594 ±   8.626  ms/op
        64  100000       1          7  avgt    5   170.264 ±  39.190  ms/op
        64  100000       1          8  avgt    5   198.890 ± 139.335  ms/op
        64  100000       1          9  avgt    5   177.951 ±   9.137  ms/op
        64  100000       1         33  avgt    5   201.769 ±  46.395  ms/op
        64  100000       1         99  avgt    5   193.591 ±  14.675  ms/op
        64  100000     100          1  avgt    5   245.485 ±   2.611  ms/op
        64  100000     100          2  avgt    5   115.320 ±   0.479  ms/op
        64  100000     100          3  avgt    5    74.766 ±   0.434  ms/op
        64  100000     100          4  avgt    5    60.157 ±   9.908  ms/op
        64  100000     100          5  avgt    5    51.021 ±   4.842  ms/op
        64  100000     100          6  avgt    5    62.580 ±   3.875  ms/op
        64  100000     100          7  avgt    5    66.906 ±   7.133  ms/op
        64  100000     100          8  avgt    5    66.110 ±   4.750  ms/op
        64  100000     100          9  avgt    5    64.027 ±   2.421  ms/op
        64  100000     100         33  avgt    5    59.401 ±   2.280  ms/op
        64  100000     100         99  avgt    5    55.285 ±   1.774  ms/op

We have a increasing number of workers (consumers) in 4 sections: unbuffered (capacity 0) or buffered (capacity 64), tiny load or 100-fold load per job. The first section (tiny load over an unbuffered channel) surprisingly looks just like mama said: on my 6 core CPU, the fastest and most stable choice is 1 producer and 5 consumers. Too few consumers hurts badly, too many consumers doesn’t matter much.

In section 2, we intensify the load on the workers. At first sight, logic rules: it take more time to do more work, and 5 workers is still optimal. But adding lots of workers skyrockets execution. Compared with section 1, if you put 99 coroutines at work, they’ll finish 3 times earlier if you ask them to do 100 times more work. What?

Section 3, buffered channel and tiny load per job. I take no issue with the fact it’s faster than unbuffered, despite the channel buffer management that’s going on, and that the ideal number of consumers seems to be 2: because the work is trivial and because of all kinds of synchronization interference. But why, when we bring in up to 99 workers, does it take 10× longer? The number interactions with the channel stays the same (and that’s almost all there is to do here). Sure, there is probably more context switching, because there is more context; but if that made such a difference, why does that not happen in the unbuffered case (and for buffer capacity 1, which I left out here)?

Section 4 repeats the surprise of section 2: with many workers, doing lots more work means we’re many times faster to the finish line.

What does that other coroutine contender Go say about this? If we `go run coroutines-spmc-benchmark.go`, I get:

100000 jobs, capacity  0,  1 workers:    59 ms
100000 jobs, capacity  0,  1 workers:    49 ms
100000 jobs, capacity  0,  1 workers:    50 ms
100000 jobs, capacity  0,  5 workers:    52 ms
100000 jobs, capacity  0,  5 workers:    50 ms
100000 jobs, capacity  0,  5 workers:    50 ms
100000 jobs, capacity  0, 33 workers:    55 ms
100000 jobs, capacity  0, 33 workers:    56 ms
100000 jobs, capacity  0, 33 workers:    56 ms
100000 jobs, capacity  0, 99 workers:    54 ms
100000 jobs, capacity  0, 99 workers:    53 ms
100000 jobs, capacity  0, 99 workers:    54 ms
100000 jobs, capacity 64,  1 workers:    25 ms
100000 jobs, capacity 64,  1 workers:    25 ms
100000 jobs, capacity 64,  1 workers:    25 ms
100000 jobs, capacity 64,  5 workers:    27 ms
100000 jobs, capacity 64,  5 workers:    28 ms
100000 jobs, capacity 64,  5 workers:    29 ms
100000 jobs, capacity 64, 33 workers:    37 ms
100000 jobs, capacity 64, 33 workers:    38 ms
100000 jobs, capacity 64, 33 workers:    36 ms
100000 jobs, capacity 64, 99 workers:    46 ms
100000 jobs, capacity 64, 99 workers:    44 ms
100000 jobs, capacity 64, 99 workers:    45 ms

This roughly corresponds to the Kotlin benchmark for a tiny load, so section 1 and 3. It paints a similar picture, but is a lot more reasonable for buffered channels: instead of a 10× increase due to idle workers, it’s just 1.5×.

Of course, I’m fairly new to Kotlin and could be doing something wrong. But right now, it suggests to me that Kotlin coroutines that are not busy working, keep themselves amused by blocking the door to the employment office that is a buffered channel.

Curious if there’s any change if you add a dispatcher to that produce() call, since by default it will run in the single-threaded dispatcher provided by runBlocking.

Wow, thanks. I wasn’t really sure what you meant because neither the function produce nor its first argument seemed related to scheduling, but the effect of passing Dispatchers.Default is spectacular:

(capacity)  (jobs)  (load)  (workers)  Mode  Cnt    Score    Error  Units
         0  100000       1          1  avgt    5   24.988 ±  0.190  ms/op
         0  100000       1          2  avgt    5   25.997 ±  0.620  ms/op
         0  100000       1          3  avgt    5   27.082 ±  2.399  ms/op
         0  100000       1          4  avgt    5   25.873 ±  0.501  ms/op
         0  100000       1          5  avgt    5   27.050 ±  1.724  ms/op
         0  100000       1          6  avgt    5   26.300 ±  0.327  ms/op
         0  100000       1          7  avgt    5   26.877 ±  0.737  ms/op
         0  100000       1          8  avgt    5   27.118 ±  0.594  ms/op
         0  100000       1          9  avgt    5   26.841 ±  0.526  ms/op
         0  100000       1         33  avgt    5   31.419 ±  1.256  ms/op
         0  100000       1         99  avgt    5   38.681 ±  1.132  ms/op
         0  100000     100          1  avgt    5  212.395 ±  2.121  ms/op
         0  100000     100          2  avgt    5  220.844 ±  0.518  ms/op
         0  100000     100          3  avgt    5  203.035 ±  3.039  ms/op
         0  100000     100          4  avgt    5  204.474 ± 19.840  ms/op
         0  100000     100          5  avgt    5  179.832 ± 21.373  ms/op
         0  100000     100          6  avgt    5  160.270 ± 16.560  ms/op
         0  100000     100          7  avgt    5  153.621 ± 11.969  ms/op
         0  100000     100          8  avgt    5  153.609 ±  7.570  ms/op
         0  100000     100          9  avgt    5  143.107 ± 28.048  ms/op
         0  100000     100         33  avgt    5  102.409 ± 32.073  ms/op
         0  100000     100         99  avgt    5   87.978 ± 13.828  ms/op
        64  100000       1          1  avgt    5   10.928 ±  0.201  ms/op
        64  100000       1          2  avgt    5   11.507 ±  0.342  ms/op
        64  100000       1          3  avgt    5   11.251 ±  0.164  ms/op
        64  100000       1          4  avgt    5   11.520 ±  0.795  ms/op
        64  100000       1          5  avgt    5   11.728 ±  0.276  ms/op
        64  100000       1          6  avgt    5   12.036 ±  0.398  ms/op
        64  100000       1          7  avgt    5   13.363 ±  6.404  ms/op
        64  100000       1          8  avgt    5   12.406 ±  0.221  ms/op
        64  100000       1          9  avgt    5   12.302 ±  0.519  ms/op
        64  100000       1         33  avgt    5   16.352 ±  1.484  ms/op
        64  100000       1         99  avgt    5   24.926 ±  2.987  ms/op
        64  100000     100          1  avgt    5  210.183 ± 33.096  ms/op
        64  100000     100          2  avgt    5  201.726 ±  5.285  ms/op
        64  100000     100          3  avgt    5  143.372 ± 32.118  ms/op
        64  100000     100          4  avgt    5  138.012 ± 28.948  ms/op
        64  100000     100          5  avgt    5  120.908 ± 15.908  ms/op
        64  100000     100          6  avgt    5  108.676 ± 26.958  ms/op
        64  100000     100          7  avgt    5  114.298 ± 14.999  ms/op
        64  100000     100          8  avgt    5  108.422 ± 12.115  ms/op
        64  100000     100          9  avgt    5  103.514 ± 23.840  ms/op
        64  100000     100         33  avgt    5   88.609 ± 12.738  ms/op
        64  100000     100         99  avgt    5   69.381 ±  8.928  ms/op

Seems like it really likes a lot of coroutines, but otherwise it looks entirely reasonable now. I would never have realized that scheduling a single producer would make any difference.