Disclaimer: I know almost nothing about how languages implement types like BigInteger
nor about optimal algorithms for factorials. I’m making some educated guesses, and I may be more or less wrong.
I doubt we can hack the time by using different concurrency models, coroutines, etc. I suspect the real difference is in the internal implementation of biginteger multiplication. I imagine there may be many approaches to such numbers and Java and Rust may use entirely different ones. It is even possible Rust one isn’t generally faster, but better for this specific case. For example, 1M! has 100k trailing zero bits. If we use optimization where we store “real” bits only + bit shift, it should be significantly faster for calculating factorials. But this is specific to factorials, or to cases where we multiply only. If we add numbers, we wouldn’t have these trailing zeros.
Second, my gut feeling says the performance will change significantly depending on the multiplication order. Going sequential is probably the worst, because we get a huge accumulator and we always multiply this huge number over and over again. I suspect it should be generally better to try multiplying numbers of similar sizes.
I tried using a PriorityQueue
to always multiply the smallest numbers:
val queue = PriorityQueue<BigInteger>(1_000_000)
for (i in 1L..1_000_000L) {
queue += BigInteger.valueOf(i)
}
while (queue.size > 1) {
queue += queue.remove().parallelMultiply(queue.remove())
}
queue.remove()
This improved from 11s for your parallel stream implementation to 3s. However, I suspect PriorityQueue
itself adds a lot of overhead as even with log(n), we still have to compare a lot of very big numbers.
I tried more naive implementation - simply put most recent result at the end of a queue:
val queue = ArrayDeque<BigInteger>(1_000_000)
for (i in 1L..1_000_000L) {
queue += BigInteger.valueOf(i)
}
while (queue.size > 1) {
queue.addLast(queue.removeFirst().parallelMultiply(queue.removeFirst()))
}
queue.removeFirst()
With this naive implementation I got to 1.8s. I suspect this has potential to be improved much more by shuffling the initial data in a smart way. For example, above multiplies 1*2, then 999999 * 1000000.
I also tried Guava’s BigIntegerMath.factorial
which supposedly uses some smart and fast algorithm, but… it took 3.2s… I don’t know why it is slower than my naive implementation. Maybe because it is single-threaded.
I suspect above findings are also the reason why parallel implementations are more than 8 times faster than sequential ones. Accidentally, they multiply in a more optimal order.
Anyway, this is not exactly an answer to your initial question. I don’t see how we can improve the performance of this specific algorithm you use. By changing the approach we can get closer to the result by Rust, but… if you really used a similar algorithm in Rust, then I won’t be surprised if after reimplementing above ideas in Rust, you will go even lower than 80ms.