Why is Kotlin Native much slower than JVM?

Hi there,

i have created a simple program to generate pairs of twin primes (that means that the second prime is the first prime + 2, eg. 3 and 5 or 5 and 7.

class Primes : Sequence<Long> {
  var next: Sequence<Long> = generateSequence(3L) { it + 2L }
  override operator fun iterator() = object : Iterator<Long> {
    override fun hasNext() = true
    override fun next(): Long {
      val head = next.first()
      next = next.filter { it % head != 0L }
      return head
    }
  }
}

fun printTwinPrimes() {
  Primes().zipWithNext().filter { it.second - it.first == 2L }.take(500).forEachIndexed { i, it -> println("$i: $it") }
}
fun main(args: Array<String>) = printTwinPrimes()

When I run this on the jvm (via gradle kotlin plugin) it takes about 3mins on my machine.
When I run this as native executable (via gradle konan plugin) it takes about 30mins.
Is there an explanation for this ?

Did you enable optimization in the conan backend? The JVM JIT compiler optimizes quite well taking into account actual usage data. For native binaries you can do this (if you have a test suite) by using profile guided optimization (where you create a profile generated by running the test suite to tune a second optimization/compilation go). On modern hardware the difference between optimization and non-optimization can make a very significant difference.

Having said that, Kotlin native is not quite ready for stable release yet and I don’t think that optimization is one of the key priorities. Of course the standard LLVM backend optimizations are available, but Kotlin specific optimizations may not be. Overall native being slower in the hot path than JVM is not surprising (the JVM generates/runs native code, is not an interpreter), but it should not be the size that you are observing.

6 Likes

Thanks for the fast reply. How do I go about setting up this profile guided optimization?
I have written a simple test using kotlin.test.

You’ll have to have a look at Clang Compiler User’s Manual — Clang 16.0.0git documentation . Note that this is not trivial, but some significant complexity. Before that, make sure you have turned on optimization of your code (and told llvm to generate code tuned for your architecture (use -mtune or -march). All of these are mainly backend functions so there’s no reason that conan should not support it (as it is a front-end).

1 Like

is it the build or the run that is taking longer?, i have been working a lot in native lately, and konan does seem to be very very slow, but the resulting executables seem to run very fast.

It’s the execution that is slow.
The build is not slow.
I want to try changing the implentation away from sequences to pre allocated arrays and see if thats faster, just out of curiosity

Btw. The default configuration does not enable optimization in the native target.

i got
program("app") { enableOptimizations(true) }
in my build.gradle. I guess that should enable optimization

Hi everyone here is the next try:

package main
class PrimeIter : Sequence<Long> {
  var next: List<Long> = (3L..1001L step 2).toList()
  val old: MutableList<Long> = mutableListOf<Long>()
  override operator fun iterator() = object : Iterator<Long> {
    override fun hasNext() = true
    override fun next(): Long {
      val head = next[0]
      old.add(head)
      next = next.filter { it % head != 0L }
      var delta = 0L
      while (next.size == 0) {
        next = ((head + 2L)..(head + delta + 1002L) step 2).toList()
        next = next.filter { x ->
          old.all { y ->
            x % y != 0L
          }
        }
        delta += 1000L
      }
      return head
    }
  }
}
fun printTwinIterPrimes() {
  PrimeIter()
    .zipWithNext()
    .filter { it.second - it.first == 2L }
    .take(500)
    .forEachIndexed { i, it -> println("$i: $it") 
}

It yields 500 twin primes in under 1 second (compiled to native).

1 Like

One thing I had a look at is that konan does not allow specifying code generation. Using SIMD instruction sets available on modern processors makes a big difference for calculation heavy tasks, but compilers are normally conservative in the code they generate (to avoid using unavailable instructions). For this reason it is possible in gcc and clang to specify the hardware level (what instructions are allowed) as well as tuning (what processor to optimize for). I assume that this will be added at some point.

In the meantime you can compile to LLVM bitcode and use llc to compile it to an executable.

Does this mean Kotlin native is much much faster than JVM? (1 sec vs 3 mins)

You should never expect any such drastic improvements. The JVM is heavily optimized and any hot codepath will be JIT compiled to optimized machine code (for the machine it runs on). There are 3 (or 4) areas that native wins in speed over Java/JVM:

  • Startup time: Native code is already optimized and has normally tiny runtimes. So starting is fast. The JVM is slow to start.
  • Value types: Java does not support value types meaning that there is less locality of data (more indirection and cache misses)
  • Memory usage: The JVM (but also vm’s in general) are quite big memory hogs, also in data size. As such a data set that would fit into cache for a native program might not for a JVM version.
  • (Hand-written assembly): In a few circumstances hand written assembly can make a big difference. The JVM does not allow this.

Note that most of these are related to the language model/runtime and I don’t expect great improvements with Kotlin Native. Only startup time should be a lot better (for small programs with don’t have a lot of their own startup time).

1 Like

To get more comparable times I increased the number to 5000 twin primes.
The JVM version takes 12s to produce that amount.
The native version takes 2 minutes(!).
The focus of kotlin/native is not better performance than JVM, but to make kotlin availlable on platforms where you can’t have a JVM.
If you need C like performance and want a modern language, look into go, nim or rust

BTW: I made a naive implementation in Rust using iterators → that one takes 50 seconds for 5000 twin primes.
:smiley: It’s not so trivial to beat the JVM in terms of performance

2 Likes

Your results on Rust probably mean that you are doing something wrong and probably both on Rust and Kotli-native. JVM is much more lenient to programmers algorithmic mistakes due to very powerful optimizations. But for optimized program without parallelization, native should still give similar results, not significantly worse ones.

2 Likes

@pdvrieze you are offering a lot of recommendations without giving any hints where we might find practical ways to implement the llvm/gcc/bitcode/SIMD optimization options or finding examples of same. yes, in theory native compilation to LLVM is a very strong and sensible choice byt he kotlin team but in practice there’s no documentation trail to make these good choices even relevant when compared to the jvm.

where can we read more about even getting a non-zero -O at native compile time? so far I’m coming up blank finding a path out of embarrassing native performance

1 Like

K/N is in beta, do not expect the same documentation level of the in production counterparts lol

I can confirm K/N is ~10 times slower than K/JVM in my code which is disappointing. Frankly i expected the more or less the same or better performance. @jim any findings on that? Still hoping to get early estimations and measurements and still believeing there are no [fundamental] reasons for it to be slower [considering ARC as the major reason].

You need to perform careful profiling to understand what makes your code work slow. It is correct not only for Kotlin-native, but for other native compilers as well. As for Kotlin-native, I did not try it, but my first suspect would be unnecessary memory allocation and boxed primitives. If the first could be more or less optimized by LLVM, the second one could not.

That’s right, but i guess the question is “is the problem in my code or in compiler/runtime?”. Any instructions on how to do it to clearly answer that?

Both of course. Runtime is not optimized yet and it probably will never reach level of optimization of JVM, but if you avoid some specific problems, even now performance could be made quite good. The problem is that people coming from JVM are rather spoiled in terms of performance. You can write anyway you like and it still will be fast. It does not work that way in a native compiler (in any native compiler).