⁠inline vararg

I made a small benchmark to check all ways to populate an arraylist. Those are the results (code below):

Test #1: 416 ms
Test #2: 380 ms
Test #3: 152 ms

Test #3 (manually adding the values to an arraylist) is almost three times faster than kotlins dedicated arrayListOf function.

The arrayListOf function could be improved like that:

inline fun <T> arrayListOf(inline vararg content: T): ArrayList<T> {
    return ArrayList<T>(content.size).apply { content.forEach { add(it) } }
}

The compiler can count the arguments fed into the function and determine content.size. The forEach statement will be compiled into 10 ArrayList.add statements. This will reduce overhead.

Benchmark:

import kotlin.system.measureTimeMillis

fun main() {
    val millis1 = measureTimeMillis {
        repeat(5000000) {
            arrayListOf(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
        }
    }
    println("Test #1: $millis1 ms")

    val millis2 = measureTimeMillis {
        repeat(5000000) {
            val list = ArrayList<Int>()
            list.add(1)
            list.add(2)
            list.add(3)
            list.add(4)
            list.add(5)
            list.add(6)
            list.add(7)
            list.add(8)
            list.add(9)
            list.add(10)
        }
    }
    println("Test #2: $millis2 ms")

    val millis3 = measureTimeMillis {
        repeat(5000000) {
            val list = ArrayList<Int>(10)
            list.add(1)
            list.add(2)
            list.add(3)
            list.add(4)
            list.add(5)
            list.add(6)
            list.add(7)
            list.add(8)
            list.add(9)
            list.add(10)
        }
    }
    println("Test #3: $millis3 ms")
}

On which platform you get these results?

Are you considered to write a good benchmark using a test framework?

How differents solutions performs on different sizes?

2 Likes

I get this result on the JVM.
The Test #3 is always faster, but more so on smaller sizes. The difference with 20 elements is rather small.
I don’t know about the “good benchmark” yet. But since inlining would get rid of the if statement and get rid of the array overhead, it’s quite obvious.

Ok, don’t worry.

Write a good test is hard, on JVM it is really hard.
I can read your test: “the first block is slower than last”, this is often true on JVM.

If you want to write a valuable test on JVM then you have to consider JMH.

See my comments here:

1 Like

I’ll try JMH once I got some time. Will it do some warmup before the benchmark itself?

Yes, it does.
You have to consider 8 minutes for each test, during you cannot use your PC.

1 Like

Done testing. Program: https://pastebin.com/ZZBM0isp
Output: https://pastebin.com/nLK9CT32

Great work @H4xX0r1337!

Are you sure? Proof it :wink:

Yes,
you have to test your own proposal, you missed it.

I was talking about language design. My proposal was introducing inline varargs. Everything done with them would be compiled into multiple statements. It would result in the same compiled code like test #3, which is the most performant.

A loop unrolling performed at compile time, this will produce a larger code size, that can reduce speed on larger functions.

A macro can implement your suggestion, but it is unsupported.

But that’s still better than creating an array with then get’s destroyed only to populate an ArrayList. Nobody will feed thousands of elements in arrayListOf.

You have still yet to prove that unrolling is in fact faster using benchmarks. arrayListOf is implemented quite differently from your unrolled version, which means that your previous benchmark is not sufficent.

Yes, I agree.
It is true for small dataset.

Did you even look at the benchmark?

Yes. You are comparing

val list = ArrayList<Int>()
list.add(1)
list.add(2)
list.add(3)
list.add(4)
list.add(5)
list.add(6)
list.add(7)
list.add(8)
list.add(9)
list.add(10)

to

arrayListOf(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

where

public fun <T> arrayListOf(vararg elements: T): ArrayList<T> =
    if (elements.size == 0) ArrayList() else ArrayList(ArrayAsCollection(elements, isVarargs = true))

private class ArrayAsCollection<T>(val values: Array<out T>, val isVarargs: Boolean) : Collection<T> {
    /* other members omitted... */
    public fun toArray(): Array<out Any?> = /* inline function expanding to: */
        if (isVarargs && values.javaClass == Array<Any?>::class.java)
            (values as Array<Any?>)
        else
            java.util.Arrays.copyOf(values, values.size, Array<Any?>::class.java)
}

and the following constructor from OpenJDK:

    public ArrayList(Collection<? extends E> c) {
        elementData = c.toArray();
        size = elementData.length;
        if (elementData.getClass() != Object[].class)
            elementData = Arrays.copyOf(elementData, size, Object[].class);
    }

In other words, you are comparing an implementation that first creates the ArrayList, and then inserts the elements to an implementation that creates an intermediate Collection backed by the varargs array, which is then passed to the constructor of ArrayList, which in turn unwraps the array and ends up using that as its backing array.

This was only an example. What about other Collections not using a backed array? The benchmark results show that test #3 is the best one yet.

Specifying the initial capacity is not really relevant to this discussion, so your benchmark essentially just compares two versions. And those two differ in so many ways that it is not safe to say that the difference in performance is caused by unrolling. A more reasonable test would be to compare you own implementation of arrayListOf to the unrolled version.

@Varia I made that test and added the Blackhole, but @H4xX0r1337’s proposal remains valid.

However it is not clear for me what is proposable.

Hi @H4xX0r1337,
I squash here some personal considerations, I hope this helps.

The ⁠inline vararg modifier is not required, an inline function can automatically inline parameters, adding ⁠inline modifier does not helps a lot.

I think that the whole discussion can be reduce to consider array as a constant and able the compiler to optimizing it.
If I wrote the code 1+2 the compiler can replace it with 3, similar if I write arrayOf(1)+arrayOf(2) the compiler should replace it with arrayOf(1, 2), it replace arrayOf(0,0,0).size with 3 and, finally, it unroll for loop on constant array.

I think that the above consideration should be enough for the compiler to perform all the requested optimization.

Currently Hotspot compiler is able to perform this optimization, maybe Graal does or will do it. In such case this proposal is not really interesting.

Another consideration is about performance: in a real world application the impact can be negligible, so the compilation phase take longer, application are larger and there is no difference with the unoptimized version.
So we need a use case with valuable results.
E.g.: performing this optimization in a typical IA library can boost the performance of 5%
But “this optimization reduce the listOf(1,2,3) cost of 1 microsecond” is not a good starting point.

@darksnake do you have some consideration about this enhancement?