Recently noticed that I frequently use generic inline functions just to avoid the boxing/unboxing of primitives. That’s good for small functions, but even a small innocent looking combo of such functions may produce a pretty large byte/machine code output. As a result:
Pressure on JVM is raised, it needs more time to jit the byte code and probably applies less optimizations.
CPU instruction caches are thrashed.
I think it’s worth to consider the idea of function specialization instead of inlining. So when you invoke a generic function compiler will generate its specialized version. JVM will latter handle the inlining if it’s appropriate from its point of view. Possible implementations:
Tag such functions with specialize modifier instead of inline.
Impose some configurable (?) limit on inline functions “density” on the compiler side, so it will switch from inlining to specialization at some threshold.
It bears keeping in mind that the jvm does a lot of optimization. The code generated by the java compiler is not particularly optimized either. It is not too hard for an optimizer to elide boxing where appropriate (althought mainly with small inlineable functions). It would be worthwhile to see what an actual JVM generates in machine code for these examples. One snag, however, could be android where there are perhaps more pressures on the bytecode itself.
I think it’s worth to consider the idea of function specialization instead of inlining.
Agreed. Inlining is something the JIT does itself a lot and there’s a limit. JIT may prefer to inline a different method for inlining and it may be right.
Specialization is something the JIT can’t do. In some cases it might help a lot, e.g., when a polymorphic call site can be made monomorphic. Inlining would help, too, but the inlining limit gets hit pretty fast.
Do you have an example of a “generic inline functions just to avoid the boxing/unboxing of primitives”? I don’t really understand what you are getting at.
To my understanding inlining in Kotlin is mainly intended (but not limited) to avoid unnecessary object creation with lambda functions. A case that the JVM is not capable of optimising on the fly.
If by “the JVM” we mean Hotspot then it can sometimes optimise out unnecessary object allocations. The problems are:
Android has a weak(er) JVM
Such optimisations are not 100% reliable
Such optimisations do not help interpreted code
In cases where inlining doesn’t happen, you can end up with profile pollution
Specialisation is being added to Java/JVM as part of project Valhalla. I don’t think it makes sense for Kotlin to duplicate that. It’d be better to wait for the wider Java ecosystem to standardise value types and then integrate with that.
package inline.benchmark
import org.openjdk.jmh.annotations.*
@State(Scope.Benchmark)
open class InlineVsNoInlineBenchmark
{
var counter = 0
@Benchmark
@BenchmarkMode(Mode.Throughput)
fun benchmarkInline()
{
// `invokeInline` and `addInline` bodies are inlined here by compiler, no boxing.
counter = invokeInline(::addInline, counter)
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
fun benchmarkNoInline()
{
// No inlining by compiler at all, boxing and other overhead.
counter = invokeNoInline(::addNoInline, counter)
}
@Benchmark
@BenchmarkMode(Mode.Throughput)
fun benchmarkSpecialized()
{
// No inlining of `invokeSpecialized` by compiler, inlined by JVM, no boxing.
counter = invokeSpecialized(counter)
}
@TearDown
fun tearDown()
{
if (counter == 0) print("never ;)")
}
}
inline fun <P, R> invokeInline(f: (P, Int) -> R, p: P): R = f(p, 12345)
fun <P, R> invokeNoInline(f: (P, Int) -> R, p: P): R = f(p, 12345)
// Emulates specialization, `addNoInline` body inlined by JVM here.
fun invokeSpecialized(p: Int): Int = addNoInline(p, 12345)
// That may be pretty complicated function which may call other inline functions
// which may call more inline functions. So in the code it may look simple and
// lightweight like `invokeInline(::add, counter)`, but under the hood that may
// translates into very massive byte/machine code.
@Suppress("NOTHING_TO_INLINE")
inline fun addInline(a: Int, b: Int) = a + b
fun addNoInline(a: Int, b: Int) = a + b
Specialisation is being added to Java/JVM as part of project Valhalla. I don’t think it makes sense for Kotlin to duplicate that. It’d be better to wait for the wider Java ecosystem to standardise value types and then integrate with that.
As always with Java evolution, we have to wait another 5 years until Valhalla will be in JVMs BTW, inlining is also a questionable feature if we will have Valhalla in the observable future, it will be covered by specialization and inlining done by JVM.
My main point is that I don’t want to calculate the probable byte code size and its impact on the performance every time I use inline functions. As we see from the benchmarks there are no alternatives to them from performance perspectives, except manual specialization, but why I should use Kotlin in this case.
@actual Please share a few examples of real-life “generic inline functions” you have been talking about.
The problem with boxing is mainly in data structures, not in functions. This is what project Valhala solves. I suspect your use cases may have some other workarounds.
As you can see there is no “unnecessary object creation with lambda functions”, a single static function instance is used, but there is a boxing/unboxing of ints. There are the data structures you talking about?
My real-life example is a generic parsing function which receive three inline functions as parameters and orchestrates them, the first one produces the elements, the second one tokenizes them, the third one receives tokens and produces the parsing result.
Of course, I want to use inlining to gain the maximum performance. While these three functions are small everything is fine, but when they grow bigger and start to use other inline functions the generated byte code size grows bigger and bigger, at some point that will harm the performance instead of improving it, since all the generated byte code is completely exclusive to this single invocation.
I might switch to no-inline implementation, but I will loss performance because of boxing/unboxing of primitives which are passed/returned from functions. To avoid this I have to manually optimize-specialize all the four functions to factor out the generic parameters which may accept primitive types. That’s is a perfect job for the compiler, but not for a human being.
It’s not production ready The benchmark code above demonstrates the problem. Scattering the same problem over more lines of code will not solve it, I think.
In this case, since it is all static, it can be done with a single instance. This will probably be different for instance methods since that needs an object per instance.