Using lambda and method reference differ significantly in performance with Java 8 streams

ShreckYe · September 27, 2020, 9:34am

I have found that when working with Java 8 streams, using lambda and method reference can differ significantly in performance.

Here is the code to invert an array of 1 GB executed both sequentially and in parallel:

inline fun warmupOnceAndMeasureTimeMillis(block: () -> Unit): Long {
    block()
    return measureTimeMillis(block)
}

fun Sequence<Long>.toLongArray(size: Int): LongArray {
    val iterator = iterator()
    return LongArray(size) { iterator.next() }
}

val size = 1 shl 27
val array = LongArray(size) { Random.nextLong() }
println(warmupOnceAndMeasureTimeMillis {
    val resultArray = LongArray(size)
    for (i in 0 until size) resultArray[i] = array[i].inv()
})

println(warmupOnceAndMeasureTimeMillis { array.asSequence().map { it.inv() }.toLongArray(size) })
println(warmupOnceAndMeasureTimeMillis { array.asSequence().map(Long::inv).toLongArray(size) })

println(warmupOnceAndMeasureTimeMillis { Arrays.stream(array).map { it.inv() }.toArray() })
println(warmupOnceAndMeasureTimeMillis { Arrays.stream(array).map(Long::inv).toArray() })

println(warmupOnceAndMeasureTimeMillis { Arrays.stream(array).parallel().map { it.inv() }.toArray() })
println(warmupOnceAndMeasureTimeMillis { Arrays.stream(array).parallel().map(Long::inv).toArray() })

And here is the result:

With Kotlin’s sequence, using lambda and method reference have roughly the same performance. However, with Java 8 streams, using lambda is much faster than method reference. One would tend to think that using method reference should be faster than lambda because there is no additional closure. Why does this happen?

And interestingly, the sequential Java 8 streams implementation with lambda seems to be faster than the imperative sequential implementation. What is going on underneath?

kyay10 · September 27, 2020, 10:19am

Don’t quote me on this, but AFAIK the Java runtime does some clever optimizations depending on how your code is structured, and so I think that using a method reference in Kotlin might be confusing it and making it take the longer path? I’m honestly not entirely sure lol. It might be related to how invokedynamic works under the hood, and so maybe method references is confusing it more and causing it to regenerate the lambda everytime? It might be related to having to generate a lambda that already knows what method it will virtually dispatch to before getting called, and so for every item a new lambda gets created based on the runtime type of that item, while in the straight forward lambda case, only one lambda gets created and virtual dispatch happens just in time? I’m honestly not quite sure. Just throwing ideas out here and there lmao.

kyay10 · September 27, 2020, 10:22am

This might be due to the fact that streams have certain guarantees when you call map on them for example, while a for loop doesn’t have those guarantees because it is more powerful. And so it might just be the runtime optimizing the map call and maybe running it at a much lower level because it is 100% sure that it will run in order and do the exact same operation on each element, while a for loop can have a break in the middle of it for example.

Wasabi375 · September 27, 2020, 10:54am

You should be using a framework like JMH to do benchmarks like this. Otherwise there is a high chance that your benchmark just returns invalid results. The JVM does a lot of optimizations which could completly invalidate your benchmark, eg. it could realize that you never use the result any of those lambdas and then just not execute it.

lamba92 · September 27, 2020, 11:37am

Run the same benchmarks using GraalVM, it is able to optimize bytecode generated from Kotlin or Scala.

mtimmerm · September 27, 2020, 12:58pm

I ran your test 10 times in a loop, and watched until the numbers stopped changing, so I could be sure that the result is valid. It is.

It’s because your stream is a LongStream, and LongStream.map takes a LongUnaryOperator argument.

Looking at the generated bytecode, we can see that when you use a lambda, the compiler compiles it into a static singleton LongUnaryOperator directly, but when you use a method reference, it gets a kotlin.jvm.functions.Function1 for the reference, and has to make a LongUnaryOperator wrapper for it.

That Function1 is a generic interface that takes and returns objects, and that means that the wrapper has to box and unbox Long to/from long, and that is why it takes so much longer.

I don’t know why the kotlin compiler does this – the way it compiles method references certainly seems suboptimal. If you’re not using primitive types, though, the speed difference will go away.

Here’s relevant byte code that makes the wrapper.

  public final invoke()V
   L0
    LINENUMBER 27 L0
    ALOAD 0
    GETFIELD LambdaspeedKt$main$$inlined$repeat$lambda$2.$array$inlined : [J
    INVOKESTATIC java/util/Arrays.stream ([J)Ljava/util/stream/LongStream;
    GETSTATIC LambdaspeedKt$main$2$2$1.INSTANCE : LLambdaspeedKt$main$2$2$1;
    CHECKCAST kotlin/jvm/functions/Function1
    DUP
    IFNULL L1
    ASTORE 1
    NEW LambdaspeedKt$sam$i$java_util_function_LongUnaryOperator$0
    DUP
    ALOAD 1
    INVOKESPECIAL LambdaspeedKt$sam$i$java_util_function_LongUnaryOperator$0.<init> (Lkotlin/jvm/functions/Function1;)V
   L1

ShreckYe · September 27, 2020, 5:51pm

OK thanks. I will try rewriting the tests with JMH when I have time.

ShreckYe · September 27, 2020, 5:53pm

GraalVM seems to be able to do many advanced optimizations. Is it safe to use it in production environments?

ShreckYe · September 27, 2020, 5:54pm

This is very comprehensive and makes perfect sense. Thank you very much.

lamba92 · September 27, 2020, 7:21pm

I believe so. Maybe @darksnake knows better.

darksnake · September 27, 2020, 7:39pm

GraalVM is just a regular Jvm. You can use it everywhere, you would use, say OpenJDK. It has much better escape analysis optimization, but for now it does not feature more modern GC options like ZGC.

Wasabi375 · September 27, 2020, 7:41pm

I wrote a benchmark for this here: GitHub - Wasabi375/lambdaBenchmarkTest

Results are in seconds / op

	avgt	error
parallel stream fun ref	0,648622	0,081415
parallel stream lambda	0,249686	0,01073
sequence fun ref	1,105237	0,14602
sequence lambda	1,094196	0,174193
stream fun ref	1,357278	0,224631
stream lambda	0,332191	0,0641

darksnake · September 28, 2020, 7:32am

For some reason your pom contains ancient kotlin compiler. I fixed version to 1.4.10 and run it on GraalVM 11. Here are the results:

Benchmark	Mode	Cnt	Score	Error	Units
LambdaBenchmark.parallelStreamFunRef	avgt	5	0.203	0.012	s/op
LambdaBenchmark.parallelStreamLambda	avgt	5	0.2	0.001	s/op
LambdaBenchmark.sequenceFunRef	avgt	5	0.222	0.003	s/op
LambdaBenchmark.sequenceLambda	avgt	5	0.224	0.016	s/op
LambdaBenchmark.streamFunRef	avgt	5	0.233	0.004	s/op
LambdaBenchmark.streamLambda	avgt	5	0.233	0.001	s/op

The 10% difference is most probably caused by primitive boxing which could be avoided in specialized streams.

ShreckYe · September 28, 2020, 7:59am

This is very impressive! So sequences can actually become faster than streams in this case with GraalVM?

darksnake · September 28, 2020, 8:02am

The benchmarking is a complicated thing. Here the score is in seconds per operation, so sequence processing is still 10% slower, than streams. But in order to understand the whole picture, one need to test it with more complicated objects and with more complicated logic inside the processing pipe. You can take the benchmark done by @Wasabi375, fix the version of the compiler and play around. I am running the benchmark on AdoptOpenJDK 14 and will post it here as soon as it finished.

darksnake · September 28, 2020, 8:45am

To make this thread complete, here are the results for AdoptOpenJDK 14 (with Eclipse OpenJ9 inside):

Benchmark	Mode	Cnt	Score	Error	Units
LambdaBenchmark.parallelStreamFunRef	avgt	5	0.43	0.121	s/op
LambdaBenchmark.parallelStreamLambda	avgt	5	10.839	0.515	s/op
LambdaBenchmark.sequenceFunRef	avgt	5	2.174	0.1	s/op
LambdaBenchmark.sequenceLambda	avgt	5	2.054	0.106	s/op
LambdaBenchmark.streamFunRef	avgt	5	1.694	0.279	s/op
LambdaBenchmark.streamLambda	avgt	5	0.763	0.06	s/op

The results are really strange and I will recheck them, but is seems like performance is much worse even for streams (it is the same machine).

Wasabi375 · September 29, 2020, 6:23pm

Nice catch. I swear I set it up with version 1.4.0… Probably messed that up because I’m not used to working with maven. I copied the pom from some github repo. I probably changed the version on accident

darksnake · September 29, 2020, 6:26pm

I would like to draw your attention to GitHub - Kotlin/kotlinx-benchmark: Kotlin multiplatform benchmarking toolkit library. It is in permanent development and is a bit bugged, but could be used. I have some working examples in kmath.

ShreckYe · September 30, 2020, 9:14am

Yeah I noticed this library as an alternative to JMH. But as you said, it’s still in incubation and I couldn’t find comprehensive docs and tutorials.

darksnake · September 30, 2020, 9:57am

It is not alternative. It uses JMH inside, just gives it a kotlin-friendly wrapper.

Topic		Replies	Views
Lambda vs function reference Language Design	3	4232	April 6, 2019
Method Reference vs Lambda Language Design	4	3565	March 31, 2022
How are Lambdas compiled to bytecode Language Design	7	2233	October 15, 2019
Sequences and Inlined Lambdas Language Design	25	4345	September 5, 2022
Kotlin sequences vs. Java streams Support	11	11648	January 28, 2020

Using lambda and method reference differ significantly in performance with Java 8 streams

Related topics