Monitoring metrics for coroutines in production

Since coroutines has graduated to stable, we plan to adopt it in our code more heavily, goals being 1) saner async code than threads + callbacks, and 2) the ability to spawn tons of coroutines cheaply where threads would be expensive.

However, as we are working on something big enough that performance matters, we want to make sure that we are able to gain insights from our usage of coroutines instead of using it blindly. When using threads and executors, we are able to get, for example, the number of threads on each state (RUNNABLE, BLOCKED, etc.), threads that are queued, etc. These metrics can help (and have helped) us troubleshoot performance issues.

What would be the usual metrics that could be gathered from coroutines usage? How would they be used to gain insight on how the system is operating? And are there any tools available for this already?

4 Likes

Anyone? :slightly_smiling_face:

Since coroutines are cheap, I am not sure whether it makes sense to monitor anything in their regards.

Coroutines help with reducing the number of (blocked) threads. So, I guess, you could continue monitoring threads.

The monitoring that looks attractive to me would be getting a gauge on the sizes of the CoroutineScheduler queues (global and local). Our biggest fear is accidentally putting slow blocking work (or worse, deadlocks) in our main dispatcher (which happened to us once on a previous project using Kotlin coroutines incorrectly, and also when using Ratpack’s coroutine-style execution), so being able to see if work is building up over time seems helpful. Would it be reasonable to expose some of these stats somewhere? As an awful hack we are considering parsing (Dispatchers.Default as ExecutorCoroutineDispatcher).executor.toString().

1 Like

I would also love if there were such metrics exposed. Taking from Erlang’s Observer tool, they have visibility of number of messages in a process’ inbox.

Since coroutines are cheap, I am not sure whether it makes sense to monitor anything in their regards.

“Do I have enough threads available in my Dispatcher to keep up with coroutine work” seems like the simplest and most obvious use case. The number of queued coroutines and wait time before invocation seems like quite valuable information to expose.

4 Likes

Has this been fixed in any recent release?

It seems like quite the gap - a truly production-ready system aiming to replace Java should have metrics exposed.

3 Likes

is there any update on this?