20% better than ReentrantLock? Inline classes are awesome

I was experimenting with inline value classes and decided to implement a simple lock using backoff spin lock idea I read somewhere. My basic testing with JMH shows crazy performance gains compared to the reentrant lock. Granted this lock (deliberately) does not support reentrancy but I consider it a code smell if you are acquiring the same lock from the same thread and a sign that you need to simplify your code.
Anyhow 20% seems a bit too much (sometimes I see almost 30%). I wonder if others will have similar results? maybe I’m missing something here or implementing is wrong.

Here is the full implementation

package remove_later

import jdk.internal.vm.annotation.Contended
import java.lang.RuntimeException
import java.util.concurrent.TimeUnit
import java.util.concurrent.atomic.AtomicReference
import java.util.concurrent.locks.Condition
import java.util.concurrent.locks.Lock
import java.util.concurrent.locks.LockSupport

// best lock for high contention?
@JvmInline
inline value class BackOff(
    //Avoid false sharing, this is a very contended field
    @Contended
    //We could probably inline atomicRef and just use one volatile field with VarHandle, we only care about CAS	...
    val ref: AtomicReference<Thread> = AtomicReference()
) : Lock {

    override fun lock() {
        val thread = Thread.currentThread()
        var backOffTimeNanos = 2L
        while (true) {
            if (ref.compareAndSet(null, thread)) {
                return
            } else {
                // im a fan of putting tons of assertions and using -ea,
                // but imo its fair if people want to replace this with an if (i just think its waste of compute)
                assert(ref.get() !== thread) { "reentrancy is not supported" }
                LockSupport.parkNanos(backOffTimeNanos)

                // we could experiment with minBackOff (2), multiplier (2), maxBackOff (512) and see if there is any improvement
                backOffTimeNanos = Math.min(backOffTimeNanos * 2, 512)
            }
        }
    }

    override fun unlock() {
        val thread = Thread.currentThread()
        if (ref.compareAndSet(thread, null)) return
        throw RuntimeException("$thread does not hold the lock, thus cannot release it")
    }

    // for now not implementing these but its trivial to do so
    override fun lockInterruptibly() {
        throw RuntimeException("Not yet implemented")
    }

    override fun tryLock(): Boolean {
        throw RuntimeException("Not yet implemented")
    }

    override fun tryLock(time: Long, unit: TimeUnit): Boolean {
        throw RuntimeException("Not yet implemented")
    }

    override fun newCondition(): Condition {
        throw RuntimeException("Not yet implemented")
    }
}

The benefit of the inline class is that this essentially compiles down to you having a single atomic ref instance vs lock instance that points to atomic ref… so one less jump for the runtime…

One question I have is, if kotlin compiles this as described above how are virtual threads going to figure out that this is a lock? cos they need to pin the underlying thread to this vthread…

I’ll post the JMH later, but I’d rather people approach it without bias and come up with their own results.

2 Likes

I’m not exactly sure what you mean. Why virtual threads need to know this is a lock? Why do they need to pin in this case?

Also, did you consider not parking the thread? lock-free algorithms often skip waiting entirely, especially if waiting is as small as 2 nanos.

Not parking the thread is good if you actually want to busy spin, cases where latency is all you care about, here we care how much stuff we can get done as fast as possible… when you dont park and keep spinning what happens is all the % of the time that scheduler gifted to your unlucky thread gets burned/wasted on the spin, waiting for the lock… you essentially starve the thread that holds the lock from CPU time… i think of this as work stealing but for locks… if some thread got the lock all other threads actively gift their time to that thread so the work gets done quickly… btw its almost never 2 nanos, its just basically a way of saying (dear scheduler come back to me as soon as you get the chance) but you are by no means guaranteed 2 nanos, its much more in practice. I can probably measure it pretty accurately.

Busy spin has its place for sure though but its weaker in performance/throughput and better in latency… don’t know by how much. Also, you often don’t use a lock in such places and just let some volatile member do the things you need.

Regarding virtual threads, what I remember is that virtual threads work correctly with locks. Meaning if you are a virtual thread that acquired a lock, the lock is acquired by the underlying vanilla thread, if your virtual thread moves on to execute somewhere else then you will have concurrency issues. Because of that they pin the virtual thread to that particular platform thread until lock is released. Which made sense to me when I heard about it their solution. But now if its just atomic ref doing some CAS they have no way of knowing…

1 Like

I understand what spinlocks are. I just feel waiting 2 nanos is like getting worst from both worlds. We still do context switches, as in regular locks and we still consume 100% of resources, as in spinlocks.

In this case, did you consider Thread.yield()?

I’m not an expert here, but I assume locks work with virtual threads, not platform threads. For this reason they work correctly without any pinning. Your solution should be also fine.

Documentation says the problem is specific to synchronized blocks. It suggests using… locks. To avoid pinning.

I don’t say I’m correct, but I suggest to not overthink this as your solution is probably fine.

The reason i don’t use yield here is because it’s not time based… with park nanos I can progressively back off longer and longer times. The more threads you have competing for this lock the better this becomes. Yield will simply say “hey scheduler I have nothing to do give the time to someone else” but if you have like 20 threads constantly saying “not me” its still waste of context switches and CPU time.

This basically adapts to how small amount of time your lock is needed, if you constantly getting rejected the lock then chances are high that the owner didn’t need it for few micros but more…

Basicaly outperforms everything in almost all scenarios other than latency, if you really want best latency then busy spin is by definition the way.

Try it out, benchmark it.

Ok, I understand this all, but what’s the point of waiting for 2ns if context switching itself takes 1000 more than that? Of course, if using virtual threads we don’t necessarily do a full context switch, so these 1200ns may be irrelevant, but still tens of nanos sounds very little and VTs switching also takes some time.

In your benchmarks, do you really see any meaningful difference depending if using 2ns, 4ns or 64ns?

No initial 2ns is irrelevant in this case, the only thing that makes a difference is the upper limit of how much backoff time can get… and how many threads are actively competing for it.

Honestly, I don’t remember why I made 2 ns, not 1, maybe so that it gets to the upper limit faster… ignore the 2ns its just something that stuck around.

1 Like

I don’t understand how

assert(ref.get() === thread) { "reentrancy is not supported" }

works.

If

if (ref.compareAndSet(null, thread)) { 

goes to the else branch then ref.get() may return

  • either another Thread reference (which holds the lock currently), so the assert() will fail
  • or it may return even null (if the lock was unlocked between the if and the ref.get() calls), so the assert() will also fail

// im a fan of putting tons of assertions and using -ea,
// but imo its fair if people want to replace this with an if (i just think its waste of compute)

I would not use assert in this case for sure to make the code deterministic.
Instead:

check(ref.get() !== thread) { "reentrancy is not supported" }

you enter the else branch if and only if the ref was not null and you failed to cas and put the current thread as owner (acquire the lock)…

so in else branch you check if the holder is you (which means whoever is using the lock is treating it as a reentrant lock) in which case assert fails…

Yes it won’t be deterministic, but many things will fail early on if you enable -ea and have a bunch of checks, that’s the whole purpose, you put a bunch of checks that usually are redundant but make sure everything is in order, and only in prod you leave out the -ea. Unless people think -ea is a bad thing, no one expects it to be deterministic.

The real question is if it’s correct to throw the exception if I specifically do not support reentrancy. Its a deliberate choice. Just like with any non reentrant lock you would get consistently deadlocked if you do try to reacquire your own lock, do they throw exception? nope… and you can easily find such lock in JDK itself.

Another argument is, what if I have 2 virtual threads or coroutines running on one underlying thread and they both try to acquire this lock… then my lock will behave correctly, it in fact shouldn’t throw an exception and it would be bug if it did. If anything assert should just do a warning log…

you enter the else branch if and only if the ref was not null
so in else branch you check if the holder is you

Between the two points you are referring to:

if (ref.compareAndSet(null, thread)) { // <--- (1)
     return
} else {
    assert(ref.get() === thread) // <--- (2)

anything could happen. Maybe your thread is preempted, and the thread currently holding the lock is scheduled for execution and performs unlock(). It could even happen that after this unlocking - and before your thread is scheduled for execution again - yet another thread performs lock() successfully.
Maybe I miss something but I don’t see the concurrency guarantee you are talking about when you assume anything about the contents of ref at (2).
I think at (2) the value of ref can be either the current thread, any other thread, or even null.

Maybe a more correct implementation would be is to check reentrancy at the start of the method?

    override fun lock() {
        val thread = Thread.currentThread()
        check(ref.get() !== thread) // reentrant
        while (true) {
            if (ref.compareAndSet(null, thread)) {
                return
            } else {
                ... // wait or yield or something
            }
        }
    }

First of all, I assume there was a bug in the code and @vach actually meant assert(ref.get() !== thread), not ===. This is only to check that we don’t reenter the lock. Now:

Yes, all these things may happen, but they don’t matter. The only thing we need to make sure is that we are not currently the holder. Other threads may lock or unlock it in the meantime.

Your alternative solution should also work and it’s potentially better as we don’t have to check the thread with each iteration of the loop.

1 Like

I assume there was a bug in the code

OK, it changes everything if he really meant !== instead of ===.
(I tried to highlight the issue in my comment above.)

but they don’t matter

Yes, provided we apply the === → !== operator change.

Although I still think that if the implementation does not support reentrant operation then the reentrancy check should not depend on the runtime settings of the JVM (-ea).

maybe I’m missing something here or implementing is wrong

Otherwise, the idea is interesting.
I would be curious what other concurrency experts think about the “final” implementation.
Because it seems simpler than the lock implementations I have seen, and may be more efficient as well (based on the OP’s measurements, I haven’t measured it myself).

One drawback I can think of is that the implementation above is always “un-fair”.

This isn’t really anything new. lock-free algorithms exists since the beginning of concurrency. They are usually better than blocking the thread if there is a high contention and the operation is pretty quick. Again, I’m not entirely sure about this specific implementation with waiting. My intuition says it makes little sense and is generally worse than alternatives, but I didn’t do any measurements myself.

1 Like

The code is correct, try running it and it will make sense.

if you fail to CAS it means it wasn’t null, someone holds it, thus in the else branch, you check if that someone is you, and the assert will throw if it resolves to true…

Arguments against putting an if:

  1. you introduce a branch, which can be very expensive if not predictable, so I generally avoid it.
  2. an IF will throw an exception which is not something non reentrant locks do, see other JDK locks that do not extend reentrantLock… its users fault for misusing it.
  3. -ea is deterministic, it will fail fast when you enable it and it will let you fail later if you don’t, which is the purpose of -ea and assertions. You don’t want them executing in prod, but you want them to catch things during dev.

But you are entirely free to put an if before loop and benchmark it, i’m sure it will still be 20% faster than reentrant lock if you have more than 4 threads actively competing for the lock… it will be much better when you have more threads.

I’ll post JMH later on, i’m pretty sure this was better than yield. The more threads you throw at this lock the better it does than the competition.

The tradeoff is reentrancy which I do not care for. Reentrancy is usually a guardrail against bad code… I cannot remember a legitimate use case where I needed to acquire my own lock multiple times, its usually done to save against someone marking everything synchronised.

Assert throws if we provide false, not true.

1 Like

you’re right, fixed the original code. I had my own check() type methods that had identical signature but those fail if condition resolves to true…
I’ll post JMH later, try it out and lets see if you see similar gains.

Ahh, one additional note: creating an inline class implementing a common interface isn’t as useful as it seems. On first sight, it looks good, because we implemented a standard lock, so we can use it with many standard tools for locks, and we inlined it, so it is simply an atomic ref. The problem is: we can’t have both of these features at the same time. If we pass it anywhere where Lock is required, it will be automatically boxed. Inlining works only if storing the lock as the BackOff type specifically. And if we have to represent it everywhere as our own type anyway, then we could at all skip the standard Lock part, because we don’t really use it.

It is still convenient to at least be able to switch between these two worlds, depending on specific needs. It is still useful, just not as useful as it may seem.

Good to know, now that explains a lot of the bytecode i was going trough. Dont care about lock interface at all i just have one convenience method like this

// used to have this
inline fun <reified R> using(lock: Lock, function: () -> R): R {
    try {
        lock.lock()
        return function.invoke()
    } finally {
        lock.unlock()
    }
}

// will this actually always inline it?
inline fun <reified R> using(lock: BackOffSpinLock2, function: () -> R): R {
    try {
        lock.lock()
        return function.invoke()
    } finally {
        lock.unlock()
    }
}

Do you think the second version will always let it inline the lock? also will it do so even though i implement lock interface or i should drop the interface too… I’ll experiment.