Can I make this algorithm faster?

ex-exist · June 3, 2025, 6:58pm

I have this code:

fun main() {
    measureTimeMillis {
        LongStream.rangeClosed(1, 1_000_000)
            .parallel()
            .mapToObj(BigInteger::valueOf)
            .reduce(BigInteger::parallelMultiply)
            .getOrNull()
    }.run {
        println(this)
    }
}

It calculates factorial in about 900ms. Today I’ve seen similar code in Rust that takes about 80ms to calculate. Is there some way to make my code faster? Maybe some external libraries or whatever…

broot · June 3, 2025, 9:35pm

First of all, use proper benchmarking util like JMH. Second, I would try using a classic loop, although it probably spends the most time in multiplying, so the overhead doesn’t matter.

Skater901 · June 4, 2025, 7:13am

Having not profiled it at all myself… I kind of suspect that the slowness is all the allocations of new BigIntegers. Assuming in Rust you can do it with primitive types, I think that’s where there’d be a big difference.

ex-exist · June 4, 2025, 11:02am

Tried replacing it with coroutines and asyncs, it became slower (possibly due of not being able to optimize it as good as Stream is). I don’t use this code in any of my projects, just got curious if I can achieve at least twice Rust speed time in this situation.

Skater901 · June 4, 2025, 11:12am

I don’t think you’re ever going to outdo Rust for speed in Kotlin, unless you use the native compiler. Since Kotlin (by default) compiles down to Java byte code, it’s still running on the JVM and not running natively. Where as Rust compiles down to native machine code. JVM can definitely optimise some things to be hecka fast, but I don’t know if it can ever outdo native code execution.

Skater901 · June 4, 2025, 11:15am

EDIT: It has been pointed out that my code is wrong and actually continually multiplies by zero which is why it’s so fast.

Ok, I actually tried broot’s suggestion, and the difference is staggering:

fun main() {
    measureTimeMillis {
        LongStream.rangeClosed(1, 1_000_000)
            .parallel()
            .mapToObj(BigInteger::valueOf)
            .reduce(BigInteger::parallelMultiply)
            .getOrNull()
    }
        .also { println(it) } // 5877

    var sum = BigInteger.ZERO

    measureTimeMillis {
        for (i in (1..1_000_000L)) {
            sum *= BigInteger.valueOf(i)
        }
    }
        .also { println(it) } // 28
}

So idk if you need to have the LongStream and the processing pipeline, but if you don’t, then do it as a loop. SO FAST.

kyay10 · June 4, 2025, 11:17am

I believe Rust has zero-overhead iterators or something like that, so the “similar” code in Rust actually compiles to a loop lol

Skater901 · June 4, 2025, 11:22am

Oh I only just noticed that this is in the Native section of the forum… does that mean you’ve been using Kotlin/Native the whole time? I did my test using Kotlin on the JVM, whoops.

klitos · June 4, 2025, 4:53pm

This has to be JVM, as Kotlin/Native doesn’t have BigInteger.

klitos · June 4, 2025, 5:00pm

The difference is staggering because you start with

var sum = BigInteger.ZERO

So every iteration makes sum still zero. And multiplying by zero is very fast! If you use BigInteger.ONE, you will find that the parallel version is faster.

Skater901 · June 5, 2025, 9:45am

LOL. Fair point. Let me try that again. xD

EDIT: Ok, yep, when you’re not multiplying by zero, it’s actually a lot different. 5463ms and 393508ms respectively. I stand very much corrected.

klitos · June 5, 2025, 10:47am

However, I naively expected the difference to be about a factor of 8 (because I’ve got 8 cores). It seems to be a lot more than that!

broot · June 5, 2025, 6:06pm

Disclaimer: I know almost nothing about how languages implement types like BigInteger nor about optimal algorithms for factorials. I’m making some educated guesses, and I may be more or less wrong.

I doubt we can hack the time by using different concurrency models, coroutines, etc. I suspect the real difference is in the internal implementation of biginteger multiplication. I imagine there may be many approaches to such numbers and Java and Rust may use entirely different ones. It is even possible Rust one isn’t generally faster, but better for this specific case. For example, 1M! has 100k trailing zero bits. If we use optimization where we store “real” bits only + bit shift, it should be significantly faster for calculating factorials. But this is specific to factorials, or to cases where we multiply only. If we add numbers, we wouldn’t have these trailing zeros.

Second, my gut feeling says the performance will change significantly depending on the multiplication order. Going sequential is probably the worst, because we get a huge accumulator and we always multiply this huge number over and over again. I suspect it should be generally better to try multiplying numbers of similar sizes.

I tried using a PriorityQueue to always multiply the smallest numbers:

val queue = PriorityQueue<BigInteger>(1_000_000)
for (i in 1L..1_000_000L) {
    queue += BigInteger.valueOf(i)
}
while (queue.size > 1) {
    queue += queue.remove().parallelMultiply(queue.remove())
}
queue.remove()

This improved from 11s for your parallel stream implementation to 3s. However, I suspect PriorityQueue itself adds a lot of overhead as even with log(n), we still have to compare a lot of very big numbers.

I tried more naive implementation - simply put most recent result at the end of a queue:

val queue = ArrayDeque<BigInteger>(1_000_000)
for (i in 1L..1_000_000L) {
    queue += BigInteger.valueOf(i)
}
while (queue.size > 1) {
    queue.addLast(queue.removeFirst().parallelMultiply(queue.removeFirst()))
}
queue.removeFirst()

With this naive implementation I got to 1.8s. I suspect this has potential to be improved much more by shuffling the initial data in a smart way. For example, above multiplies 1*2, then 999999 * 1000000.

I also tried Guava’s BigIntegerMath.factorial which supposedly uses some smart and fast algorithm, but… it took 3.2s… I don’t know why it is slower than my naive implementation. Maybe because it is single-threaded.

I suspect above findings are also the reason why parallel implementations are more than 8 times faster than sequential ones. Accidentally, they multiply in a more optimal order.

Anyway, this is not exactly an answer to your initial question. I don’t see how we can improve the performance of this specific algorithm you use. By changing the approach we can get closer to the result by Rust, but… if you really used a similar algorithm in Rust, then I won’t be surprised if after reimplementing above ideas in Rust, you will go even lower than 80ms.

broot · June 5, 2025, 6:51pm

On the other hand, my second gut says that maybe multiplying numbers of similar sizes is actually the slowest, and it is better to choose very big and very small ones. Multiplication requires multiplying each digit/bit with each digit/bit, so by choosing similar numbers we go square. 1000*2 is smaller than 45*45. I don’t know.

Skater901 · June 6, 2025, 6:03am

Concurrency would make a difference though, yeah? If you have an 8 core CPU, and you single thread this, you’re only using 1 core for all the calculations. If you use 8 threads, in theory you can get all 8 cores multiplying, for an 8x speed boost. (Or, as we’ve seen, a MUCH higher speed boost, probably because of the different multiplication order you talked about)

Landei · June 6, 2025, 6:23pm

If you just want to speed up the calculation of the factorial, this page is the your friend: Fast Factorial Functions

minogin · June 11, 2025, 6:26am

Actually Java could bring the same or even the better performance than Rust due to runtime bytecode compilation. Check out what Hotspot technology is.
Practically that means you need to warm up your code thousand times and only then start benchmarking. You might note x1000 difference to the interpreted code.

softmoore · June 25, 2025, 3:21pm

To fully comprehend why the Kotlin version is slower, you need to have a basic understanding of virtual machines and the difference between just-in-time (JIT) compilation versus ahead-of-time (AOT) compilation. I won’t go into details on these topics except for the following brief summary.

Native executables created by AOT compilers (e.g., the Rust compiler) typically start faster and run faster initially, but JIT compilers can often provide better overall optimization based on runtime execution patterns. Many server-based applications (e.g., web applications) are designed to run for extended periods of time – months to possibly years. For these types of applications, JIT compilation is an excellent choice. But for many applications, startup time and initial performance are more important, and for these applications, AOT compilation is usually a better choice.

On my computer, when I ran your code on the JVM 11 times in a loop, it took 219 milliseconds the first time, but it averaged 20 milliseconds for the remaining 10 iterations.

I also used the GraalVM tool native-image to create an executable file of the same program (looping 11 times) for my computer. The latter took only 30 milliseconds the first time through the loop, but averaged 24 milliseconds for the remaining 10 iterations.

As expected, rerunning both the JVM and the native executable versions multiple times yielded similar but slightly different results since measureTimeMillis includes times for background processes.

Topic		Replies	Views
Why is Kotlin Native much slower than JVM? Native	27	27155	January 1, 2023
Can Kotlin do this algorithm faster? Native	2	2471	February 2, 2019
BigInteger, Range, Stream, and iterators Libraries	11	6187	February 24, 2016
Kotlin native performance Native	8	12167	June 12, 2020
Kotlin native vs Rust Native	69	42636	February 14, 2024

Can I make this algorithm faster?

Related topics