Can I make this algorithm faster?

I have this code:

fun main() {
    measureTimeMillis {
        LongStream.rangeClosed(1, 1_000_000)
            .parallel()
            .mapToObj(BigInteger::valueOf)
            .reduce(BigInteger::parallelMultiply)
            .getOrNull()
    }.run {
        println(this)
    }
}

It calculates factorial in about 900ms. Today I’ve seen similar code in Rust that takes about 80ms to calculate. Is there some way to make my code faster? Maybe some external libraries or whatever…

First of all, use proper benchmarking util like JMH. Second, I would try using a classic loop, although it probably spends the most time in multiplying, so the overhead doesn’t matter.

1 Like

Having not profiled it at all myself… I kind of suspect that the slowness is all the allocations of new BigIntegers. Assuming in Rust you can do it with primitive types, I think that’s where there’d be a big difference.

1 Like

Tried replacing it with coroutines and asyncs, it became slower (possibly due of not being able to optimize it as good as Stream is). I don’t use this code in any of my projects, just got curious if I can achieve at least twice Rust speed time in this situation.

I don’t think you’re ever going to outdo Rust for speed in Kotlin, unless you use the native compiler. Since Kotlin (by default) compiles down to Java byte code, it’s still running on the JVM and not running natively. Where as Rust compiles down to native machine code. JVM can definitely optimise some things to be hecka fast, but I don’t know if it can ever outdo native code execution.

1 Like

EDIT: It has been pointed out that my code is wrong and actually continually multiplies by zero which is why it’s so fast.

Ok, I actually tried broot’s suggestion, and the difference is staggering:

fun main() {
    measureTimeMillis {
        LongStream.rangeClosed(1, 1_000_000)
            .parallel()
            .mapToObj(BigInteger::valueOf)
            .reduce(BigInteger::parallelMultiply)
            .getOrNull()
    }
        .also { println(it) } // 5877

    var sum = BigInteger.ZERO

    measureTimeMillis {
        for (i in (1..1_000_000L)) {
            sum *= BigInteger.valueOf(i)
        }
    }
        .also { println(it) } // 28
}

So idk if you need to have the LongStream and the processing pipeline, but if you don’t, then do it as a loop. SO FAST.

2 Likes

I believe Rust has zero-overhead iterators or something like that, so the “similar” code in Rust actually compiles to a loop lol

2 Likes

Oh I only just noticed that this is in the Native section of the forum… does that mean you’ve been using Kotlin/Native the whole time? I did my test using Kotlin on the JVM, whoops. :stuck_out_tongue:

1 Like

This has to be JVM, as Kotlin/Native doesn’t have BigInteger.

1 Like

The difference is staggering because you start with

var sum = BigInteger.ZERO

So every iteration makes sum still zero. And multiplying by zero is very fast! If you use BigInteger.ONE, you will find that the parallel version is faster.

1 Like

LOL. Fair point. Let me try that again. xD

EDIT: Ok, yep, when you’re not multiplying by zero, it’s actually a lot different. 5463ms and 393508ms respectively. I stand very much corrected. :slight_smile:

1 Like

However, I naively expected the difference to be about a factor of 8 (because I’ve got 8 cores). It seems to be a lot more than that!

Disclaimer: I know almost nothing about how languages implement types like BigInteger nor about optimal algorithms for factorials. I’m making some educated guesses, and I may be more or less wrong.

I doubt we can hack the time by using different concurrency models, coroutines, etc. I suspect the real difference is in the internal implementation of biginteger multiplication. I imagine there may be many approaches to such numbers and Java and Rust may use entirely different ones. It is even possible Rust one isn’t generally faster, but better for this specific case. For example, 1M! has 100k trailing zero bits. If we use optimization where we store “real” bits only + bit shift, it should be significantly faster for calculating factorials. But this is specific to factorials, or to cases where we multiply only. If we add numbers, we wouldn’t have these trailing zeros.

Second, my gut feeling says the performance will change significantly depending on the multiplication order. Going sequential is probably the worst, because we get a huge accumulator and we always multiply this huge number over and over again. I suspect it should be generally better to try multiplying numbers of similar sizes.

I tried using a PriorityQueue to always multiply the smallest numbers:

val queue = PriorityQueue<BigInteger>(1_000_000)
for (i in 1L..1_000_000L) {
    queue += BigInteger.valueOf(i)
}
while (queue.size > 1) {
    queue += queue.remove().parallelMultiply(queue.remove())
}
queue.remove()

This improved from 11s for your parallel stream implementation to 3s. However, I suspect PriorityQueue itself adds a lot of overhead as even with log(n), we still have to compare a lot of very big numbers.

I tried more naive implementation - simply put most recent result at the end of a queue:

val queue = ArrayDeque<BigInteger>(1_000_000)
for (i in 1L..1_000_000L) {
    queue += BigInteger.valueOf(i)
}
while (queue.size > 1) {
    queue.addLast(queue.removeFirst().parallelMultiply(queue.removeFirst()))
}
queue.removeFirst()

With this naive implementation I got to 1.8s. I suspect this has potential to be improved much more by shuffling the initial data in a smart way. For example, above multiplies 1*2, then 999999 * 1000000.

I also tried Guava’s BigIntegerMath.factorial which supposedly uses some smart and fast algorithm, but… it took 3.2s… I don’t know why it is slower than my naive implementation. Maybe because it is single-threaded.

I suspect above findings are also the reason why parallel implementations are more than 8 times faster than sequential ones. Accidentally, they multiply in a more optimal order.

Anyway, this is not exactly an answer to your initial question. I don’t see how we can improve the performance of this specific algorithm you use. By changing the approach we can get closer to the result by Rust, but… if you really used a similar algorithm in Rust, then I won’t be surprised if after reimplementing above ideas in Rust, you will go even lower than 80ms.

2 Likes

On the other hand, my second gut says that maybe multiplying numbers of similar sizes is actually the slowest, and it is better to choose very big and very small ones. Multiplication requires multiplying each digit/bit with each digit/bit, so by choosing similar numbers we go square. 1000*2 is smaller than 45*45. I don’t know.

Concurrency would make a difference though, yeah? If you have an 8 core CPU, and you single thread this, you’re only using 1 core for all the calculations. If you use 8 threads, in theory you can get all 8 cores multiplying, for an 8x speed boost. (Or, as we’ve seen, a MUCH higher speed boost, probably because of the different multiplication order you talked about)

If you just want to speed up the calculation of the factorial, this page is the your friend: Fast Factorial Functions

Actually Java could bring the same or even the better performance than Rust due to runtime bytecode compilation. Check out what Hotspot technology is.
Practically that means you need to warm up your code thousand times and only then start benchmarking. You might note x1000 difference to the interpreted code.

2 Likes