Using lambda and method reference differ significantly in performance with Java 8 streams

I have found that when working with Java 8 streams, using lambda and method reference can differ significantly in performance.

Here is the code to invert an array of 1 GB executed both sequentially and in parallel:

inline fun warmupOnceAndMeasureTimeMillis(block: () -> Unit): Long {
    block()
    return measureTimeMillis(block)
}

fun Sequence<Long>.toLongArray(size: Int): LongArray {
    val iterator = iterator()
    return LongArray(size) { iterator.next() }
}

val size = 1 shl 27
val array = LongArray(size) { Random.nextLong() }
println(warmupOnceAndMeasureTimeMillis {
    val resultArray = LongArray(size)
    for (i in 0 until size) resultArray[i] = array[i].inv()
})

println(warmupOnceAndMeasureTimeMillis { array.asSequence().map { it.inv() }.toLongArray(size) })
println(warmupOnceAndMeasureTimeMillis { array.asSequence().map(Long::inv).toLongArray(size) })

println(warmupOnceAndMeasureTimeMillis { Arrays.stream(array).map { it.inv() }.toArray() })
println(warmupOnceAndMeasureTimeMillis { Arrays.stream(array).map(Long::inv).toArray() })

println(warmupOnceAndMeasureTimeMillis { Arrays.stream(array).parallel().map { it.inv() }.toArray() })
println(warmupOnceAndMeasureTimeMillis { Arrays.stream(array).parallel().map(Long::inv).toArray() })

And here is the result:

271
1285
1277
183
1173
156
494

With Kotlin’s sequence, using lambda and method reference have roughly the same performance. However, with Java 8 streams, using lambda is much faster than method reference. One would tend to think that using method reference should be faster than lambda because there is no additional closure. Why does this happen?

And interestingly, the sequential Java 8 streams implementation with lambda seems to be faster than the imperative sequential implementation. What is going on underneath?

1 Like

Don’t quote me on this, but AFAIK the Java runtime does some clever optimizations depending on how your code is structured, and so I think that using a method reference in Kotlin might be confusing it and making it take the longer path? I’m honestly not entirely sure lol. It might be related to how invokedynamic works under the hood, and so maybe method references is confusing it more and causing it to regenerate the lambda everytime? It might be related to having to generate a lambda that already knows what method it will virtually dispatch to before getting called, and so for every item a new lambda gets created based on the runtime type of that item, while in the straight forward lambda case, only one lambda gets created and virtual dispatch happens just in time? I’m honestly not quite sure. Just throwing ideas out here and there lmao.

This might be due to the fact that streams have certain guarantees when you call map on them for example, while a for loop doesn’t have those guarantees because it is more powerful. And so it might just be the runtime optimizing the map call and maybe running it at a much lower level because it is 100% sure that it will run in order and do the exact same operation on each element, while a for loop can have a break in the middle of it for example.

1 Like

You should be using a framework like JMH to do benchmarks like this. Otherwise there is a high chance that your benchmark just returns invalid results. The JVM does a lot of optimizations which could completly invalidate your benchmark, eg. it could realize that you never use the result any of those lambdas and then just not execute it.

3 Likes

Run the same benchmarks using GraalVM, it is able to optimize bytecode generated from Kotlin or Scala.

2 Likes

I ran your test 10 times in a loop, and watched until the numbers stopped changing, so I could be sure that the result is valid. It is.

It’s because your stream is a LongStream, and LongStream.map takes a LongUnaryOperator argument.

Looking at the generated bytecode, we can see that when you use a lambda, the compiler compiles it into a static singleton LongUnaryOperator directly, but when you use a method reference, it gets a kotlin.jvm.functions.Function1 for the reference, and has to make a LongUnaryOperator wrapper for it.

That Function1 is a generic interface that takes and returns objects, and that means that the wrapper has to box and unbox Long to/from long, and that is why it takes so much longer.

I don’t know why the kotlin compiler does this – the way it compiles method references certainly seems suboptimal. If you’re not using primitive types, though, the speed difference will go away.

Here’s relevant byte code that makes the wrapper.

  public final invoke()V
   L0
    LINENUMBER 27 L0
    ALOAD 0
    GETFIELD LambdaspeedKt$main$$inlined$repeat$lambda$2.$array$inlined : [J
    INVOKESTATIC java/util/Arrays.stream ([J)Ljava/util/stream/LongStream;
    GETSTATIC LambdaspeedKt$main$2$2$1.INSTANCE : LLambdaspeedKt$main$2$2$1;
    CHECKCAST kotlin/jvm/functions/Function1
    DUP
    IFNULL L1
    ASTORE 1
    NEW LambdaspeedKt$sam$i$java_util_function_LongUnaryOperator$0
    DUP
    ALOAD 1
    INVOKESPECIAL LambdaspeedKt$sam$i$java_util_function_LongUnaryOperator$0.<init> (Lkotlin/jvm/functions/Function1;)V
   L1
5 Likes

OK thanks. I will try rewriting the tests with JMH when I have time.

1 Like

GraalVM seems to be able to do many advanced optimizations. Is it safe to use it in production environments?

1 Like

This is very comprehensive and makes perfect sense. Thank you very much.

1 Like

I believe so. Maybe @darksnake knows better.

1 Like

GraalVM is just a regular Jvm. You can use it everywhere, you would use, say OpenJDK. It has much better escape analysis optimization, but for now it does not feature more modern GC options like ZGC.

3 Likes

I wrote a benchmark for this here: https://github.com/Wasabi375/lambdaBenchmarkTest

Results are in seconds / op

avgt error
parallel stream fun ref 0,648622 0,081415
parallel stream lambda 0,249686 0,01073
sequence fun ref 1,105237 0,14602
sequence lambda 1,094196 0,174193
stream fun ref 1,357278 0,224631
stream lambda 0,332191 0,0641
3 Likes

For some reason your pom contains ancient kotlin compiler. I fixed version to 1.4.10 and run it on GraalVM 11. Here are the results:

Benchmark Mode Cnt Score Error Units
LambdaBenchmark.parallelStreamFunRef avgt 5 0.203 0.012 s/op
LambdaBenchmark.parallelStreamLambda avgt 5 0.2 0.001 s/op
LambdaBenchmark.sequenceFunRef avgt 5 0.222 0.003 s/op
LambdaBenchmark.sequenceLambda avgt 5 0.224 0.016 s/op
LambdaBenchmark.streamFunRef avgt 5 0.233 0.004 s/op
LambdaBenchmark.streamLambda avgt 5 0.233 0.001 s/op

The 10% difference is most probably caused by primitive boxing which could be avoided in specialized streams.

2 Likes

This is very impressive! So sequences can actually become faster than streams in this case with GraalVM?

The benchmarking is a complicated thing. Here the score is in seconds per operation, so sequence processing is still 10% slower, than streams. But in order to understand the whole picture, one need to test it with more complicated objects and with more complicated logic inside the processing pipe. You can take the benchmark done by @Wasabi375, fix the version of the compiler and play around. I am running the benchmark on AdoptOpenJDK 14 and will post it here as soon as it finished.

1 Like

To make this thread complete, here are the results for AdoptOpenJDK 14 (with Eclipse OpenJ9 inside):

Benchmark Mode Cnt Score Error Units
LambdaBenchmark.parallelStreamFunRef avgt 5 0.43 0.121 s/op
LambdaBenchmark.parallelStreamLambda avgt 5 10.839 0.515 s/op
LambdaBenchmark.sequenceFunRef avgt 5 2.174 0.1 s/op
LambdaBenchmark.sequenceLambda avgt 5 2.054 0.106 s/op
LambdaBenchmark.streamFunRef avgt 5 1.694 0.279 s/op
LambdaBenchmark.streamLambda avgt 5 0.763 0.06 s/op

The results are really strange and I will recheck them, but is seems like performance is much worse even for streams (it is the same machine).

2 Likes

Nice catch. I swear I set it up with version 1.4.0… Probably messed that up because I’m not used to working with maven. I copied the pom from some github repo. I probably changed the version on accident :frowning_face:

1 Like

I would like to draw your attention to https://github.com/Kotlin/kotlinx-benchmark library. It is in permanent development and is a bit bugged, but could be used. I have some working examples in kmath.

3 Likes

Yeah I noticed this library as an alternative to JMH. But as you said, it’s still in incubation and I couldn’t find comprehensive docs and tutorials.

1 Like

It is not alternative. It uses JMH inside, just gives it a kotlin-friendly wrapper.

1 Like