Could you please stop posting benchmark results without a proper benchmarking environment (JMH). In this case, you also use bad coroutine code, which creates additional channels (the channels are the most expensive primitive in coroutines).
I know it’s not optimal to use channel for such a task, but here all implementations try to use channel when possible, IMO the key point of a benchmark is to compare the same thing, not to use the best possible solution, why does that not make sense?
And can you plz elaborate on how to use jmh to do cross language benchmarks? Or do you think it is not valid to do so at all.
JVM Benchmarks without JMH are not valid at all due to warm-up time and unpredictable deoptimizations.
About channels, they mean different things in different languages. In your cases it should be Flow, not Channel, and it should use mapping lazy operation, not create a new channel. There are a lot of tricks to do it right. And it does not make any sense to compare “different libraries doing the same thing” because different libraries do different things and have different optimizations.
Also, one thing that the author of the benchmark-game is missing is that different implementations of the VM could do things differently.
In your cases it should be Flow, not Channel, and it should use mapping lazy operation
Good to know, will learn about it.
due to warm-up time and unpredictable deoptimizations.
I believe other langs with a JIT compiler have the same warm up penalty, and I don’t think JIT warm-up would contribute to 4-5x slowness and make the result invalid (comparing to go without JIT).
different implementations of the VM could do things differently
That is true, but at the same time ppl want to see comparisons of different aspects between languages or VMs, e.g. GC by binarytrees, I would just point out what is obviously wrong and suggest how to improve instead of saying ‘whatever, just don’t do it.’, which adds no real value to the problem.
It is not only warm-up, it is also deoptimizations.
What I try to explain is that you can’t draw any conclusions from micro-benchmarks. They are always wrong. Starting that different libraries and different runtimes optimize things differently, so it always happens, that you compare something that your runtime is optimized to do better with something another runtime should not do at all. Also, you need to remember to compare the code written by people with a similar level of experience.
The good programmer will write good code in any language.
Comparing coroutines with the loom is another level of mistake because coroutines are API for asynchronous programming and loom is an implementation of parallel execution. It is like proverbial oranges and apples. Coroutines could be used on top of the loom.
different libraries and different runtimes optimize things differently.
Totally agree, but how do you measure that? Just blindly take what is being advertised? Does micro benchmarks not reflect any aspect of this?
loom is an implementation of parallel execution.
That’s sth new to me, AFAIK, ppl have been associating loom with java’s concurrency support (by connecting it to coroutines like this post), do you mean that parallelization is the only goal of project loom?
Coroutines could be used on top of the loom
That is true, but I don’t quite get the logic why loom itself cannot be tested or benchmarked for concurrency tasks while being able to power a better coroutine implementation as fundamenal.
Why not? It could be far more.
Some types of JVM start off interpreting code; then, once a method has been called often enough (or accounted for enough time), it gets compiled in a background thread and used once available. If it gets called much more often, or the JVM may decide to recompile it with heavier optimisations. (It also keeps track of the assumptions made while optimising, in case they change. It can even de-optimise if appropriate.)
And since the difference between interpreted code and heavily-optimised compiled code could easily be a factor of 100 or more, I’d have no trouble believing in a factor if 4–5.
I was talking about a very specific micro-benchmark case here and if you look at the code, there’re only 2 or 3 simple functions within 10 lines, pypy with jit runs ~5x faster than cpython, c#/kotlin-jvm with jit runs ~2~3x faster than loom jvm, how can loom ea jvm have such big jit perf regression?
There is not much information in a bad measurement.
Benchmarking is hard, especially on JVM.
Agree, but at the same time I feel like many ppl misunderstand the goal of jmh.
Think about why it’s called microbenchmark harness. The whole reason why it is created is that when you benchmark functions that run in ms/ns, you can never get the result estimator like standard deviation look reasonable because of JIT overhead, you have to introduce some warm-up mechanism. But what about a program that runs in 5-10s? Is it still ‘micro’? Isn’t the estimator like stddev always reasonable without warmup?
Remember what’s the 95% confidence intervals of normal distribution? It depends on the μ and σ you get, the warm-up mechanism in JMH is to serve that, JMH does not make itself reasonable without mathematics.
Statements like any jvm benchmark without JMH is invalid
sounds more invalid.
If one still argues that JIT overhead can be non-trival in this case, I assure you that a language whose JIT compiler can introduce 1s+ overhead for such a simple program, (maybe as well as JIT tech itself), have been thrown away, for centuries and for good. That is literally ‘STOP THE WORLD’ and no one will care about GC overhead anymore.
The question is what you a trying to measure with those benchmarks. If you a trying to win some kind of competition about which runtime (not a language!) will run a useless piece of code faster from a cold start - OK, you can do anything you want, but please state your intentions clearly and don’t confuse people. If you want to compare how those runtimes will treat the real-life program, which is usually much more complicated than the micro-benchmark and will never be short-lived (save for console utils case, which we do not discuss here), you need to do it properly.
Also, you completely missed my primary point. You are not comparing the same things, you are comparing things that you think will provide the same result. There are a lot of ways to get the same results and some of them are faster, than others. Your exact code for coroutines is not correct. And it is not your problem, it is extremely complex task to provide correct benchmarks.
The final remark is about GC. People like to talk about the GC overhead, but people rarely measure it. In most real life cases it is less than 1%. Without understanding this and a lot of other things (not only about JVM, but about native compilations as well), all your conclusions are good only for click-baits.
If you mean Flow in kotlin, that should definately be used when comparing to ruby fiber, or python async generator, but when comparing to go’s solution with channel, what’s the point of using flow? Isn’t that comparing orange to apple instead? It’s to compare the same infrastructure for concurrency support, not to make the fastest program, please don’t get it wrong.
And actually, all the criticism you’ve made is based on a very wrong understanding of the goal of my benchmark in the first place.
It’s not about building the fastest prime-finding program, but solving the problem in a particular way that massive coroutines are constructed and communicate with each other, thus measuring performance of of massive coroutine/virtual threads scheduling and communication.
Please don’t get it wrong any more
I wouldn’t expect the Loom implementation to be optimized for performance yet. I wouldn’t even expect it to perform good for some more years. I assume it is a huge undertaking and refactoring of JVM core.
On the other hand, once it starts performing better, it seems like Kotlin coroutines will automatically benefit from it, because coroutines can then be built on top of JVM native concepts.
Kotlin Coroutines and Java Loom are both technologies that are used for managing concurrency, but they have different features and are used in different ways. Kotlin Coroutines are already available for use and are designed to make it easier to write asynchronous, non-blocking code, while Java Loom is an upcoming feature that is expected to improve performance and scalability by using lightweight fibers instead of threads.
A relevant discussion on Talking Kotlin Will Loom Kill Kotlin Coroutines? | Talking Kotlin #120 - YouTube
And what is the real difference in your own words?
Same can be said about Loom.
Same can be said about Kotlin coroutines
I think both concepts are pretty much the same or very similar. Kotlin authors had to implement coroutines on top of the JVM which doesn’t support them, so they are implemented at the bytecode level, using entirely new APIs. Loom is built-in to JVM itself and it allows to partially reuse the API of threading. I suspect Loom could potentially provide a little better performance as Kotlin coroutines had to use tricks and workarounds, and Loom can work more “natively”. Conceptually, Loom could be compared to Kotlin code where all functions are automatically suspend and all blocking functions in the stdlib were transparently replaced with their suspending equivalents. This is not exactly what Loom is, but from the perspective of someone familiar with Kotlin, this is probably the easiest way to describe Loom.
But coroutines in Kotlin are much more than merely a concurrency basic building block. Coroutines introduce their own abstract concepts like jobs, contexts or dispatchers, they provide their own API which is very idiomatic and fits Kotlin nicely, they provide their own features like structured concurrency, cancelling, flows, etc. And they work across multiple platforms, not only JVM. So even if eventually Loom will turn out to be a huge success and it won’t make sense for Kotlin authors to maintain their own scheduling based on platform threads, and they will switch to Loom entirely, coroutines will still add a value here. Some people will probably turn from Kotlin coroutines to Loom, other will still prefer coroutines.
(Naïve question, that I don’t think has been addressed in this topic yet:)
Is there likely to be any scope for changing the implementation of coroutines to use Loom?
If so, only the Kotlin compiler would need to change; then Kotlin code could gain some of the benefits of Loom without any changes.
(Of course, that would require Kotlin to be compiled for a Java version known to support Loom; to support older Java versions, it would have to be able to use the existing approach for the foreseeable future, too.)
I’m not sure if I understood you correctly, but if you meant to compile Kotlin suspend functions into regular bytecode when targeting Loom, then I’m not sure if this is even possible. Even when using Loom, there are still virtual and platform threads and the compiler can’t know who will invoke a suspend function. And if running inside a platform thread, we still need to use the current approach. Another problem is passing a CoroutineContext
, although that could be probably handled by a ThreadLocal
. Another problem is compatibility with the existing code not targeting Loom.
Of course, Kotlin could introduce some kind of a modifier like “virtual-thread-only” suspend function. From the perspective of Kotlin coroutines it would be a suspend function, but it would be compiled into a regular code and it would run some checks if the thread is virtual.
But well… I’m mostly making educated guesses here. I’m definitely not good enough in coroutines internals to see a whole picture. Future looks pretty interesting, finally there is some activity in the JVM world which is worth observing