Inlining Arrays of Lambdas

Suppose you want to sort a list of objects by certain values in order:

data class Data(val a: Boolean, val b: Boolean, val c: Boolean, val d: Int)

The easiest way to do it is to just list the properties you want to sort by in lambdas:

val comparatorLambdas = compareBy<Data>({ it.a }, { it.b }, { it.c }, { it.d })

Alternatively, you could list the properties in a compareBy/thenBy chain:

val comparatorChain = compareBy<Data> { it.a }
            .thenBy { it.b }
            .thenBy { it.c }
            .thenBy { it.d }

Or, you could construct the comparator entirely manually:

private inline fun Int.ifDiff(onDiff: (Int) -> Unit) = if (this != 0) onDiff(this) else Unit

val comparatorManual = object : Comparator<Data> {
        override fun compare(o1: Data, o2: Data): Int {
            o1.a.compareTo(o2.a).ifDiff { return it }
            o1.b.compareTo(o2.b).ifDiff { return it }
            o1.c.compareTo(o2.c).ifDiff { return it }
            o1.d.compareTo(o2.d).let { return it }
        }
    }

How efficient is each comparator? Let’s look at a benchmark of using each one to sort an array of 100,000 elements:

Benchmark                                   (N)  Mode  Cnt   Score   Error  Units
ComparatorBenchmarkHelper.sortChain      100000  avgt   10  37.531 ± 1.532  ms/op
ComparatorBenchmarkHelper.sortManual  100000  avgt   10  26.763 ± 0.257  ms/op
ComparatorBenchmarkHelper.sortLambdas    100000  avgt   10  93.089 ± 3.965  ms/op

We see that sortLambdas is incredibly slow compared to sortChain and sortManual, with sortManual being clearly the fastest. Why? Because sortChain creates nested Comparators, leading to multiple virtual calls, and sortLambdas has an array of lambdas, leading to an iteration as well as multiple virtual calls.

This would be much improved if it were possible to inline a vararg of lambdas. Take the above compareBy call, which uses this method:

public fun <T> compareBy(vararg selectors: (T) -> Comparable<*>?): Comparator<T> {
    require(selectors.size > 0)
    return Comparator { a, b -> compareValuesByImpl(a, b, selectors) }
}

Inlining compareValuesByImpl, we would get:

public fun <T> compareBy(vararg selectors: (T) -> Comparable<*>?): Comparator<T> {
    require(selectors.size > 0)
    return Comparator { a, b ->
        for (fn in selectors) {
            val v1 = fn(a)
            val v2 = fn(b)
            val diff = compareValues(v1, v2)
            if (diff != 0) return diff
        }
        return 0
    }
}

If we were able to inline selectors, a compiler would be able to resolve the require at compilation time, and from there, it could unroll the loop and inline each lambda, effectively becoming comparatorManual (except that compareValues also does unnecessary null checks, which an inlining compiler could know to eliminate on non-null values).

(This is a bit similar to my prior topic on Sequences and Inlined Lambdas, and would be one of the optimizations that its concepts would create. This one happens to be more immediately apparent, as a Comparator can easily become a major part of code execution when sorting large lists, so it’s important that they be as efficient as possible, and existing vararg compareBy as simply intolerable performance.)

Hi, @derektom14,
it looks interesting to me, can you provide a reproducer using different data size?

Maybe it is possible to change the actual compareBy with something like:

fun <T> compareBy(vararg selectors: (T) -> Comparable<*>?): Comparator<T> =
        selectors.drop(1).fold(compareBy(selectors.first())) { c, s -> c.thenBy(s) }

What do you think?

That method is effectively the same as comparatorChain, except that you’re no longer inlining the lambdas, so the performance is instead the same as comparatorLambdas:

Benchmark                                    (N)  Mode  Cnt   Score   Error  Units
ComparatorBenchmarkHelper.sortChain       100000  avgt   10  38.646 ± 2.322  ms/op
ComparatorBenchmarkHelper.sortEfficient   100000  avgt   10  28.239 ± 2.376  ms/op
ComparatorBenchmarkHelper.sortLambdas     100000  avgt   10  87.527 ± 4.053  ms/op
ComparatorBenchmarkHelper.sortLambdasAlt  100000  avgt   10  84.818 ± 4.888  ms/op

I can add different sizes, but I don’t expect it to tell much of a different story, as each run will still take roughly n log(n) comparisons, so all of the times should increase while maintaining the same relative values.

Here are the benchmarks with varying values of N:

Benchmark                                   (N)  Mode  Cnt   Score    Error  Units
ComparatorBenchmarkHelper.sortChain        1000  avgt   10   0.223 ±  0.003  ms/op
ComparatorBenchmarkHelper.sortChain       10000  avgt   10   2.972 ±  0.035  ms/op
ComparatorBenchmarkHelper.sortChain      100000  avgt   10  38.040 ±  1.075  ms/op
ComparatorBenchmarkHelper.sortEfficient    1000  avgt   10   0.165 ±  0.004  ms/op
ComparatorBenchmarkHelper.sortEfficient   10000  avgt   10   2.176 ±  0.036  ms/op
ComparatorBenchmarkHelper.sortEfficient  100000  avgt   10  27.268 ±  0.215  ms/op
ComparatorBenchmarkHelper.sortLambdas      1000  avgt   10   0.470 ±  0.043  ms/op
ComparatorBenchmarkHelper.sortLambdas     10000  avgt   10   6.672 ±  0.736  ms/op
ComparatorBenchmarkHelper.sortLambdas    100000  avgt   10  99.819 ± 22.008  ms/op

The order remains clear.

Hi, @derektom14,
did you understand why sortLambdasAlt is so different from sortChain.
What you effectively measured?
Can you share your code?

The reason is due to inlining. The chain relies on two methods, compareBy and thenBy:

@kotlin.internal.InlineOnly
public inline fun <T> compareBy(crossinline selector: (T) -> Comparable<*>?): Comparator<T> =
    Comparator { a, b -> compareValuesBy(a, b, selector) }

@kotlin.internal.InlineOnly
public inline fun <T> Comparator<T>.thenBy(crossinline selector: (T) -> Comparable<*>?): Comparator<T> =
    Comparator { a, b ->
        val previousCompare = this@thenBy.compare(a, b)
        if (previousCompare != 0) previousCompare else compareValuesBy(a, b, selector)
    }

When the lambdas are inlined, their invocations no longer require a virtual call.

However, in your alternative, you are using a vararg of lambdas, and as the topic of this thread points out, those can’t be inlined. Therefore, you end up creating multiple lambda objects, and for each step in the chain, a virtual call must be made by the selector for each parameter.

This exact phenomenon is covered in this Medium article.