[QUESTION] parallel processing like LMAX disruptor

Revin · January 1, 2022, 4:26am

imagine that we have a data processing pipeline, consists of 3 steps

val finalValue=fn3(fn2(fn1(originalValue)))

however fn2 is cpu intensive, thus we want to parallelize it

                    /-> fn2(on thread1) --\
fn1(originalValue) -+-> fn2(on thread2) --+-> fn3(...) -->
                    \-> fn2(on thread3) --/

which is the best approach?

I can think of two approaches:
use two channels to fan-out and fan-in, but this doesn’t preserve the source data(originalValue) order, so we have to buffer, reorder, potentially block to provide back pressure… too complex

another approach is to use LMAX disruptor, a famous java library: LMAX Disruptor
but this adds another dependency and is not kotlin friendly(similarly we can use java stream, we also have to consider how to integrate into stream’s rather complex internal apis for managing parallelisms/stop infinite streams).
most importantly, disruptor excels in low latency high throughput, it adds code/config complexity for this, but not all project need them all. I’m not concerned too much about latency, as I may later refactor the code to distribute fn2 not only on cpu cores(threads) but also on clusters(different machines). custom data processing pipeline with different performance targets, may not be suitable on spark/flink

darksnake · January 1, 2022, 12:43pm

I do not think that anyone could give you clear advice base on the data in the question, because there are lot of details you are not talking about. Like how time-consuming are your tasks. Let me just give you a several notes:

There are no problems with using Java APIs from kotlin-jvm. You just add a few extensions and make them quite kotlin-friendly.
If you have huge amounts of data and thinking on distributed computing, you probably should go with GitHub - Kotlin/kotlin-spark-api: This projects gives Kotlin bindings and several extensions for Apache Spark. We are looking to have this as a part of Apache Spark 3.x.
If you are working locally then almost all tools are there in coroutines-core library. The paralel flow processing is not yet implemented because it is still being desined, but it is quite easy to do your own extension to do exactly that. Like it is done in kmath.
DataForge framework was designed for similar yet much more complicated purpose (it propagates data alongside with the analysis metadata). I do not think it is what you need and you should stick with coroutines though.

Topic		Replies	Views
Why would using coroutines be slower than sequential for a big file?	23	9596	May 10, 2018
Using Channels with threads for a data pipeline Support	3	1192	January 1, 2019
Zippy parallel vector additions in Kotlin (seems like coroutines + locking = slower) Support	1	846	November 20, 2018
Must-go-fast: Sequences, Flows, Bounded Channels? Support	16	6821	May 1, 2019
Best Practices for parallelizing different types of IO	2	1107	May 24, 2022

[QUESTION] parallel processing like LMAX disruptor

Related topics