Coroutines for low latency/zero gc software

I’ts been few years since i last used coroutines, nowdays i’m doing low latency and zero gc software, essentially systems where once application is bootstraped and jvm warmed up no objects are to be created especially on critical path.

We often use our own event loop implementations (we = Chronicle Software) which are often busy spinning on a cpu.

For some applications where interaction with an external services amounts to an event being published and another event eventually observed to read the response i was wondering to try and come up with a nice coroutine usage where my event loop can suspend.

However i have never seen any conent on how memory efficient or coroutines are. I know that under the hood compiler will do possibly the best version of implementing suspending logic (e.g. state machines), but main question i have is : if the same coroutines are running on the same underlying thread (i’m not creating or ending any coroutines) do they create any garbage just for their operation?.

My intuition says that it shouldn’t, and some rough testing shows that it actually does. I’ll try to experiment further but i wonder if there is already any documents or anyone already experienced in this?

At the moment i cant share the code, but i could make a simple example for later on.

2 Likes

if the same coroutines are running on the same underlying thread (i’m not creating or ending any coroutines) do they create any garbage just for their operation?

A coroutine is every suspend function and it may require an allocation on suspend invocation.
Kotlin compiler tries to minimize allocations, however in many cases they are unavoidable.
Therefore, “i’m not creating or ending any coroutines” means "we never invoke any suspend fun" for me.

I’ll try to experiment further but i wonder if there is already any documents or anyone already experienced in this?

Machinery is describer here: https://github.com/Kotlin/KEEP/blob/master/proposals/coroutines.md#implementation-details
However, in my experience, it isn’t possible to avoid allocation on JVM, and GC pauses are unpredictable.
Minimize allocation is a -generally- a good practice, but please consider a proper GC like Shenandoah or ZGC (heap>32GB).

2 Likes

Are you sure? i dont know if things have changed drasticly since beta but coroutine is not the suspend fucntion, suspend function is normal function that has a hint for compiler that it can be suspended, hence the state machine is created to save its state at the point of suspend… while coroutine is the lightweight thread that has a scope and context… not the same thing.

Zero gc applications are very possible and we are doing it a lot.

I’m not 100% sure about this, but I would expect that yes, coroutines inevitably generate allocations. Internally they allocate a single continuation object per each invocation of a suspend function. And this is happening even if you don’t suspend. Then inside this continuation they store the current state which may or may not require allocations - I don’t know, I would guess it doesn’t. Machinery for suspending, scheduling and resuming is also quite complicated beast, so it may generate allocations.

If you use profilers and you don’t see too many allocations for Continuation, please be aware each function creates its own subtype. If it is possible to configure the profiler to track all subtypes of Continuation then this is where I would start.

But to be honest… I don’t really get what you try to achieve here. The main idea behind coroutines is to start coroutines per-task or per-request, start many of them, fork and join freely, use them as disposable units of work and at the same time don’t get a performance hit comparable to starting that many threads. If I understood you correctly that you plan to start a limited number of coroutines to process tasks in a loop, maybe base their number on the number of CPUs or even map them 1:1 to threads, then I don’t really see how you can benefit from using coroutines here. They will be like threads with additional complexity.

1 Like

What i’m trying to do is to make my event loops coroutine friendly, so you could submit a suspendable function and write your logic in a simple synchronous way, rather than defining bunch of closures for what to do in case of each event, it makes the code less easy to follow.

I just have a specific performance goal that i cannot compromise in either case.

Why are Continuations created even if you do not suspend? and why cant current state be something that is allocated once per site and reused later on… for simplicity sake lets assume all function artuments involved are primitives, that definitely should be completely reusable.

I guess i’ll need to deep dive into coroutine world.

Are you sure?

Sorry, you are right.
However a large part of the allocation is in suspending functions, @broot already explains this and I agree with his thought.

Zero gc applications are very possible and we are doing it a lot.

Congratulations! So I suppose that you are the unique company using EpsylonGC in production using a small heap.

Why are Continuations created even if you do not suspend?

Invoking a suspension function may allocate a Continuation, more details are available in the KEEP.

Hmm, but how could you “submit a suspendable function” to an event loop and at the same time “not create or end any coroutines”? Ok, if you mean to have a fixed number of coroutine workers that are scheduled on a smaller number of threads (or single thread), then well, that makes sense to me.

I believe there are several reasons for this, but it is hard to explain without going into too much details. Continuations are used for both handling of suspending and for resuming and also to create an equivalent of the stack in the case we have to suspend. Maybe creating of continuations could be postponed a little, but I believe that would complicate the bytecode and wouldn’t change too much. I mostly speculate here, I don’t know what were reasons to design it like this.

How would you like to reuse it? If you invoke a function 10 times concurrently, you need 10 separate memory spaces to store local variables. Normally, we create a separate stack frame for each such invocation, but because coroutines are stackless, we need to store them on heap.

1 Like

in my view there are 2 benefits to coroutines :

  1. you get light weight threads so its actually ok to spam them for each request,
  2. MOST important is you get code clarity by writing synchronous code while having actual async.
    (structural concurrency is also imo part of 2)

you are correct, its not the usual usecase where people just spam coroutines, i’m not using them because they are light weight and spammable i’m using them for code clarity… the event loop only has very few jobs, but those jobs are not suspendable which makes their code quite hard to follow instead if i could submit a suspendable piece of code to run then i could not worry about the previous context.

Regarding 10 invocations its very simple.
Lets say i have my function signature like this someFun(int a, long b)
I (the compmiler) would create an arrayList of some temporary structure that can hold the args of that function (a,b) and any state before the suspend…

Once allocated that arrayList could grow to the point where you have maximum concurrent suspended invocations… so in your exmaple its 10, once the size is 10 then everything is reused, some of those routines will finish, that list will still be there with those structures in them ready to be reused…

Reusing things is the name of the game for zero GC.

Another point that interests me is how the synchronization happens? for example if i have single threaded executor backing my coroutines vs many threads is coroutine code smart enough to avoid unnecessary synchronizations or it does it anyway assuming any thread could be taking over the continuation?

because synchronization will flush out my cpu caches and impact performance significantly

Ok, so you mean creating pools of continuations. Yes, this should be technically doable, but I’m pretty sure Kotlin coroutines don’t do this right now. Seems like an overkill for most cases. And unfortunately, this part is controlled by the compiler, so you can’t easily hack the existing behavior.

I don’t think it prefers the same thread for the same coroutine. In some cases coroutines try to not go through re-scheduling and instead invoke the code directly. For example, if you have a coroutineScope() and inside multiple launch(), then one of launches will be most probably not scheduled, but invoked by the thread that handled the parent coroutine. Maybe something similar happens if you join() on another coroutine which is using the same dispatcher.

However, if a coroutine got suspended, I believe it will resume at any arbitrary thread - whichever will be the first. Of course, coroutines machinery puts memory barriers there, so we have happens-before and similar guarantees, but it will not provide optimal use of CPU cache.

If you need guaranteed coroutine-thread affinity, then the easiest would be to create a pool of single-threaded dispatchers. Alternatively, I think it should be possible to do this with a custom dispatcher implementation. But I’m not sure if that will allow you to use CPU cache optimally. If the thread already went somewhere else, its caches could be already replaced.

I’m not talking about prefering the same thread. I’m saying if coroutine is guaranteed to always run in the same thread then there is no need for synchronization, will it still do it? or it has no way of knowing that its running in a single thread executor?

Yes i was talking about single threaded dispatchers but what i was asking is will it still do any memory barriers when its in single threaded dispatcher or it has no clue about it?

Obviously you dont need any memory barriers when you re in single threaded dispatcher… those membars affect not only this thread but the etire core maybe all cores if the segment is dirty…

I’m not sure if I got your question right. My guess would be that even if I run a single coroutine using a single-threaded dispatcher, it still puts memory barriers when suspending. I don’t think coroutines framework do this manually, but for example executor implementations in JVM do this after processing a task.

But I don’t really know. It goes much deeper than my knowledge :slight_smile:

Yes thats another reason why they are trash when it comes to performance, there is no reason to have any membars when you are thread confined which is any single threaded executor or dispatcher.

Thanks for the input though.

A kotlinx.coroutines’s Job is multi-thread on JVM and any suspending function can switch thread, there isn’t no guarantee of “memory barrier free” or “allocation free” code.

If you need a different implementation, then you can consider to implement a different coroutine machinery based on kotlin.coroutines package (not kotlinx.coroutines), however staying “allocation free” remains really hard.

1 Like

Yeah, coroutines machinery is one thing, but again, additional allocations are compiled directly into the bytecode, so we would really need a modded Kotlin compiler or maybe a compiler plugin.

suspend fun test2() {
    delay(1000)
    val a = 0
}
  public static final java.lang.Object test2(kotlin.coroutines.Continuation<? super kotlin.Unit>);
    Code:
       0: aload_0
       1: instanceof    #11                 // class Test1Kt$test2$1
       4: ifeq          36
       7: aload_0
       8: checkcast     #11                 // class Test1Kt$test2$1
      11: astore_2
      12: aload_2
      13: getfield      #15                 // Field Test1Kt$test2$1.label:I
      16: ldc           #16                 // int -2147483648
      18: iand
      19: ifeq          36
      22: aload_2
      23: dup
      24: getfield      #15                 // Field Test1Kt$test2$1.label:I
      27: ldc           #16                 // int -2147483648
      29: isub
      30: putfield      #15                 // Field Test1Kt$test2$1.label:I
      33: goto          45
      36: new           #11                 // class Test1Kt$test2$1
      39: dup
      40: aload_0
      41: invokespecial #20                 // Method Test1Kt$test2$1."<init>":(Lkotlin/coroutines/Continuation;)V
      44: astore_2
      45: aload_2
      46: getfield      #24                 // Field Test1Kt$test2$1.result:Ljava/lang/Object;
      49: astore_1
      50: invokestatic  #30                 // Method kotlin/coroutines/intrinsics/IntrinsicsKt.getCOROUTINE_SUSPENDED:()Ljava/lang/Object;
      53: astore_3
      54: aload_2
      55: getfield      #15                 // Field Test1Kt$test2$1.label:I
      58: tableswitch   { // 0 to 1
                     0: 80
                     1: 103
               default: 113
          }
      80: aload_1
      81: invokestatic  #36                 // Method kotlin/ResultKt.throwOnFailure:(Ljava/lang/Object;)V
      84: ldc2_w        #37                 // long 1000l
      87: aload_2
      88: aload_2
      89: iconst_1
      90: putfield      #15                 // Field Test1Kt$test2$1.label:I
      93: invokestatic  #44                 // Method kotlinx/coroutines/DelayKt.delay:(JLkotlin/coroutines/Continuation;)Ljava/lang/Object;
      96: dup
      97: aload_3
      98: if_acmpne     108
     101: aload_3
     102: areturn
     103: aload_1
     104: invokestatic  #36                 // Method kotlin/ResultKt.throwOnFailure:(Ljava/lang/Object;)V
     107: aload_1
     108: pop
     109: getstatic     #50                 // Field kotlin/Unit.INSTANCE:Lkotlin/Unit;
     112: areturn
     113: new           #52                 // class java/lang/IllegalStateException
     116: dup
     117: ldc           #54                 // String call to \'resume\' before \'invoke\' with coroutine
     119: invokespecial #57                 // Method java/lang/IllegalStateException."<init>":(Ljava/lang/String;)V
     122: athrow
}

It creates its continuation in #36. And yes, this is really what you get for a very simple suspend function :smiley:

1 Like

BTW, did you try a similar research for Project Loom? It could be much easier to do as Kotlin coroutines are some kind of a “hack” over what JVM provides (or rather what it doesn’t). But I imagine for this type of software it may be pretty hard to jump to newer versions of JVM.

No i havent even tried it. With the amount of crap i’ve seen in java SDK i have no hope for them to get it right especially since they want to make it somehow work nice with executors and old Threads…

Btw why do you post the bytecode directly, you can decompile to java from there no?

I wish i had the time to dive into this but from first look it seems it creates one but reuses it from there…

1 Like

As far as I understand it, when a function is invoked, it receives a continuation of a caller, then it creates its own continuation and starts executing the function from the beginning. If the coroutine is resumed, the function is invoked with its own continuation, then the code uses this continuation to restore the state and jump to the proper offset. So new allocation happens per-suspend-fun-invocation and not per-suspend or per-resume.

1 Like

Thats definitely not the case, if it received continuation it does not create the new one, only new is here is if you didnt break with those if conditions… try decompiling back to java to see what it does, reading bytecode (even if you think you know) often is tricky.