Summing Arrays faster when split across coroutines

benjaminhill · June 10, 2020, 4:53pm

Summing an array is 1/3rd faster when splitting the work across coroutines. I did not expect this result.

I’m adding up values from a decoded image - lots of pixel RGB values stuffed in a large IntArray. I started to wonder if it would be faster to handle each “channel” (step 3) as a separate coroutine to run the summing in parallel.

Much to my surprise, it was. Unless I’m measuring it wrong, which is very possible. But if this is correct, this makes me wonder if there is a better way of doing math on an array that auto-splits to the optimal number of coroutines.

import kotlinx.coroutines.*
import kotlin.random.Random
import kotlin.system.measureNanoTime

const val NUM_TO_AVG = 500
const val LOOPS = 10
const val MAX_INT = 5_000
const val ARR_SIZE = 1080*720*3 // An image worth of pixels
const val IGNORE_FIRST = 3

fun main()  {
    runBlocking(Dispatchers.Default) {
        launch {
            val linearTime = arraySumTest(true)
            println("Linear: $linearTime")
        }
        launch {
            val cTime = arraySumTest(false)
            println("CoroutineTime: $cTime")
        }
    }
}

suspend fun arraySumTest(isLinear:Boolean): Int {
    val arr = IntArray(ARR_SIZE) {
        Random.nextInt(MAX_INT)
    }

    val target = IntArray(arr.size)
    val times = mutableListOf<Long>()

    repeat(NUM_TO_AVG) {
        val time = measureNanoTime {
            repeat(LOOPS) {
                if(isLinear) {
                    for(i in target.indices step 3) {
                        target[i] += arr[i]
                        target[i+1] += arr[i+1]
                        target[i+2] += arr[i+2]
                    }
                } else {
                    coroutineScope {
                        launch {
                            for(i in target.indices step 3) {
                                target[i] += arr[i]
                            }
                        }
                        launch {
                            for(i in target.indices step 3) {
                                target[i+1] += arr[i+1]
                            }
                        }
                        launch {
                            for(i in target.indices step 3) {
                                target[i+2] += arr[i+2]
                            }
                        }
                    }
                }
            }
        }
        times.add(time)
    }
    return (times.drop(IGNORE_FIRST).average()/1_000).toInt()
}

darksnake · June 10, 2020, 6:34pm

Well, you have multi-threaded code versus single-threaded. It CAN BE faster (not always though). The parallel collection processing is actually not a favorite task for coroutines (it is parallel, but synchronous). The best tool for such tasks is Java parallel stream. It is also quite kotlin-friendly. Parallel stream automatically utilizes your current CPU to its maximum capacity.

fvasco · June 11, 2020, 7:56am

Please consider to use JMH to produce a good measure.

You concern regarding sum performances, but the code adds two on each one

for(i in target.indices step 3) {
  target[i] += arr[i] // 1 add
  target[i+1] += arr[i+1] // 3 add
  target[i+2] += arr[i+2] // 3 add
}

try

for(i in target.indices) {
  target[i] += arr[i]
}

Finally, your code puts too much contention on same memory regions, please consider to start a coroutine to sum a partition of 64K continuous elements of target.

Topic		Replies	Views
Difficulty seeing speedup in using coroutines Support	12	180	March 19, 2025
Corutines context speed	0	989	January 9, 2018
Zippy parallel vector additions in Kotlin (seems like coroutines + locking = slower) Support	1	835	November 20, 2018
Why does is this code executed faster in a thread pool with 1 thread compared to one with more Support	1	1194	August 12, 2021
Understanding Kotlin's coroutines performance Native	5	4514	March 4, 2025

Summing Arrays faster when split across coroutines

Related topics