Why is Kotlin Native much slower than JVM?


#1

Hi there,

i have created a simple program to generate pairs of twin primes (that means that the second prime is the first prime + 2, eg. 3 and 5 or 5 and 7.

class Primes : Sequence<Long> {
  var next: Sequence<Long> = generateSequence(3L) { it + 2L }
  override operator fun iterator() = object : Iterator<Long> {
    override fun hasNext() = true
    override fun next(): Long {
      val head = next.first()
      next = next.filter { it % head != 0L }
      return head
    }
  }
}

fun printTwinPrimes() {
  Primes().zipWithNext().filter { it.second - it.first == 2L }.take(500).forEachIndexed { i, it -> println("$i: $it") }
}
fun main(args: Array<String>) = printTwinPrimes()

When I run this on the jvm (via gradle kotlin plugin) it takes about 3mins on my machine.
When I run this as native executable (via gradle konan plugin) it takes about 30mins.
Is there an explanation for this ?


#2

Did you enable optimization in the conan backend? The JVM JIT compiler optimizes quite well taking into account actual usage data. For native binaries you can do this (if you have a test suite) by using profile guided optimization (where you create a profile generated by running the test suite to tune a second optimization/compilation go). On modern hardware the difference between optimization and non-optimization can make a very significant difference.

Having said that, Kotlin native is not quite ready for stable release yet and I don’t think that optimization is one of the key priorities. Of course the standard LLVM backend optimizations are available, but Kotlin specific optimizations may not be. Overall native being slower in the hot path than JVM is not surprising (the JVM generates/runs native code, is not an interpreter), but it should not be the size that you are observing.


#3

Thanks for the fast reply. How do I go about setting up this profile guided optimization?
I have written a simple test using kotlin.test.


#4

You’ll have to have a look at https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization . Note that this is not trivial, but some significant complexity. Before that, make sure you have turned on optimization of your code (and told llvm to generate code tuned for your architecture (use -mtune or -march). All of these are mainly backend functions so there’s no reason that conan should not support it (as it is a front-end).


#5

is it the build or the run that is taking longer?, i have been working a lot in native lately, and konan does seem to be very very slow, but the resulting executables seem to run very fast.


#6

It’s the execution that is slow.
The build is not slow.
I want to try changing the implentation away from sequences to pre allocated arrays and see if thats faster, just out of curiosity


#7

Btw. The default configuration does not enable optimization in the native target.


#8

i got
program("app") { enableOptimizations(true) }
in my build.gradle. I guess that should enable optimization


#9

Hi everyone here is the next try:

package main
class PrimeIter : Sequence<Long> {
  var next: List<Long> = (3L..1001L step 2).toList()
  val old: MutableList<Long> = mutableListOf<Long>()
  override operator fun iterator() = object : Iterator<Long> {
    override fun hasNext() = true
    override fun next(): Long {
      val head = next[0]
      old.add(head)
      next = next.filter { it % head != 0L }
      var delta = 0L
      while (next.size == 0) {
        next = ((head + 2L)..(head + delta + 1002L) step 2).toList()
        next = next.filter { x ->
          old.all { y ->
            x % y != 0L
          }
        }
        delta += 1000L
      }
      return head
    }
  }
}
fun printTwinIterPrimes() {
  PrimeIter()
    .zipWithNext()
    .filter { it.second - it.first == 2L }
    .take(500)
    .forEachIndexed { i, it -> println("$i: $it") 
}

It yields 500 twin primes in under 1 second (compiled to native).


#10

One thing I had a look at is that konan does not allow specifying code generation. Using SIMD instruction sets available on modern processors makes a big difference for calculation heavy tasks, but compilers are normally conservative in the code they generate (to avoid using unavailable instructions). For this reason it is possible in gcc and clang to specify the hardware level (what instructions are allowed) as well as tuning (what processor to optimize for). I assume that this will be added at some point.

In the meantime you can compile to LLVM bitcode and use llc to compile it to an executable.