Sequences and Inlined Lambdas

Suppose you want to do the very common operation of applying a filter and a mapping to a list, and you want the code to be very efficient. The most obvious first solution is to write:

val processedItemNames = items
    .filter { it.isProcessed }
    .map { it.name }

This is the most concise code, but with a catch that both filter and map create new lists. After the inlining is done, the code becomes:

    val processedItems = ArrayList<Item>()
    for (element in items) if (element.isProcessed) processedItems.add(element)

    val processedItemNames = ArrayList<String>(processedItems.collectionSizeOrDefault(10))
    for (item in processedItems)
        processedItemNames.add(item.name)

However, what if we used sequences instead?

val processedItemNames = items
    .asSequence()
    .filter { it.isProcessed }
    .map { it.name }
    .toList()

This will no longer create the intermediate list! However, the price is that the operations are no longer inlined, because the Sequence object needs to be prepared to execute the lambdas at arbitrary times. Under the hood, this is what the operations actually look like:

val itemsSequence = Sequence { items.iterator() }
    
    val processedItemsSequence = FilteringSequence(itemsSequence, true) { it.isProcessed }

    val processedItemsNamesSequence = TransformingSequence(processedItemsSequence) { it.name }
    
    val processedItemsNames = ArrayList<String>()
    
    for (item in processedItemsNamesSequence) {
        processedItemsNames.add(item)
    }

Neither of these solutions reach what we would likely consider the most performant approach:

    val processedItemNames = ArrayList<Item>(items.size.coerceAtMost(10))
    for (item in items) {
        if (item.isProcessed) {
            processedItemNames.add(item)
        }
    }

Here, we create exactly one additional list, and there are no created lambda objects, and only a single Iterator. However, the code doesn’t look nearly as nice to read or write, and is not as immediately understood to be correct.

And yes, we could use filterNotNull:

    val processedItemNames = items
        .filterNotNull { item -> item.takeIf { it.isProcessed }?.name }

However, that applies only to the specific example of filter followed by map when there are otherwise no nulls, and not to the general case of sequence and list operations, of which there are many.

I think that one of the goals of the language should be to reduce the conflict between writing performant code and writing understandable code as much as possible. So, how could we solve this problem in Kotlin? It may require some new constructs.

Suppose we had the following:

inline interface Stream<T> {
    fun forEach(operation: (T) -> Unit)
}

inline class IterableStream<T>(private val iterable: Iterable<T>): Stream<T> {
    override fun forEach(operation: (T) -> Unit) {
        iterable.forEach(operation)
    }

}

inline class FilteringStream<T>(private val sourceStream: Stream<T>, private inline val predicate: (T) -> Boolean): Stream<T> {
    override fun forEach(operation: (T) -> Unit) {
        sourceStream.forEach { item: T ->
            if (predicate(item)) {
                operation(item)
            }
        }
    }
}

inline class TransformingStream<T, R>(private val sourceStream: Stream<T>, private inline val transformer: (T) -> R): Stream<R> {
    override fun forEach(operation: (R) -> Unit) {
        sourceStream.forEach { item: T ->
            operation(transformer(item))
        }
    }
}

inline fun <T> Iterable<T>.asStream() = IterableStream(this)
inline fun <T> Stream<T>.filter(predicate: (T) -> Boolean) = FilteringStream(this, predicate)
inline fun <T, R> Stream<T>.map(transformer: (T) -> R) = TransformingStream(this, transformer)
inline fun <T> Stream<T>.toList() {
    val list = ArrayList<T>()
    forEach { item -> list.add(item) }
}

So, it’s a complicated proposal, but what’s the result? Take an implementation by Streams:

    val processedItemNames = items
        .asStream()
        .filter { it.isProcessed }
        .map { it.name }
        .toList()

The first thing we do is inline asStream, filter, and map:

     val processedItemNames = (TransformingStream(FilteringStream(IterableStream(items)) { it.isProcessed }) { it.name }).toList()   

Then toList:

    val processedItemNames = ArrayList<String>()
    (TransformingStream(FilteringStream(IterableStream(items)) { it.isProcessed }) { it.name }).forEach { item -> processedItemNames.add(item) }

Then TransformingStream:

    val processedItemNames = ArrayList<String>()
    (FilteringStream(IterableStream(items)) { it.isProcessed }).forEach { item -> processedItemNames.add(item.name) }

Then FilteringStream:

    val processedItemNames = ArrayList<String>()
    (IterableStream(items)).forEach { item -> if (item.isProcessed) processedItemNames.add(item.name) }

Then finally IterableStream:

    val processedItemNames = ArrayList<String>()
    items.forEach { item -> if (item.isProcessed) processedItemNames.add(item.name) }

We now arrive at close to the optimal code! (The size information could be improved by adding a sizeRange field, but that would complicate the example.) However, this did require inlining more than what is currently possible in Kotlin, so we would need some more language features.

The inline operator on an interface indicates that all implementations must be inline, and the operator is used on all methods that deal with inline classes that contain multiple values. This is because there would be no standard way to pass these objects around as objects, so instead, they don’t exist as objects at all. The FilteringStream and TransformingStream constructs exist only in Kotlin, and not in bytecode. This entire concept only works if all information about each Stream is known at compilation time, so one may not pass any of these Streams as arguments or write extension methods on them, unless they’re inline. Many of the names could be improved, but I think the general concept would be a major improvement towards how we currently deal with collection and sequence operators, to combine both lazy evaluation and inlining.

Have you done any benchmarking on the simple-minded sequence example?

Remember that the JVM does its own dynamic optimisation, including inlining; what it ends up executing may be very far from the bytecode.

Instead of worrying I would just perform some simple benchmarks :wink:

Besides, sure, functional programming itself requires some additional allocation, but the point is that you don’t have to worry about internal works of some operations and you can focus instead on clarity of your code.

I have not. However, three things to keep in mind:

  • The Sequence classes involved are going to be used hundreds if not thousands of times throughout most codebases, on just as many lambdas, which makes them significantly more difficult to inline, particularly on mobile, which has fewer resources for these optimizations. It’s also not going to help at all in optimizing app startup.
  • Even when the JIT perfectly inlines every lambda in the Sequences, the resulting code is still less efficient than the most optimal solution due to its greater reliance on iterators, creating one Iterator per segment of the sequence chain and having to handle a hasNext and next call for every part. Not the most terrible drawback, but still not ideal.
  • The JIT is too late for eliminating the lambdas from the bytecode. This is much worse on Android, where APK size is incredibly important and lambdas are desugared into full classes, so any space savings made by Java 8 (if they exist) don’t apply at all.

That said, I will try out some benchmarks, perhaps tomorrow night, and let you know how they go.

Your suggestion would make it illegal to return a Stream from non-inline functions or store it outside of a function’s scope since it would be impossible to represent an inline function (or multi-value inline objects, for that matter), which I think would be a serious deal breaker.

As others have said, a benchmark seems in order to prove that this is an actual performance issue. My gut feeling says that the time spent allocating and deallocating the extra Sequences and functions pales in comparison to the actual operation it will perform on the (generally) many objects.

Yes, for such things one would have to use existing Sequences. The inline classes here exist only for the purpose of coupling an object with an inlined lambda and attaching behavior to it, I can think of no better representation for that than the existing class. Other use cases would require using Sequence or storing the Stream into a collection.

As promised, I completed my benchmarks, using the Java Microbenchmark Harness and mostly following the instructions here.

The very dumb data class:
data class Item(val onSale: Boolean, val price: Optional<Int>)
The setup:

        val random = Random()
        val DATA_FOR_TESTING: MutableList<Item> = ArrayList()
        for (i in 0 until N) {
            val onSale = random.nextBoolean()
            val hasPrice = random.nextBoolean()
            val price = if (hasPrice) Optional.of(random.nextInt()) else Optional.empty()
            DATA_FOR_TESTING.add(Item(onSale, price))
        }

The benchmarks:

    fun Blackhole.consumeResult(test: () -> Any) = consume(test.invoke())

    @Benchmark
    fun loopOptimalOnce(bh: Blackhole)= bh.consumeResult {
        val resultList = mutableListOf<Optional<Int>>()
        DATA_FOR_TESTING.forEach { item ->
            if (item.onSale) {
                val price = item.price
                resultList.add(price)
            }
        }
    }

    @Benchmark
    fun loopOptimalTwice(bh: Blackhole)= bh.consumeResult {
        val resultList = mutableListOf<Int>()
        DATA_FOR_TESTING.forEach { item ->
            if (item.onSale) {
                val price = item.price
                if (price.isPresent) {
                    resultList.add(price.get())
                }
            }
        }
    }

    @Benchmark
    fun loopListsOnce(bh: Blackhole) = bh.consumeResult {
        DATA_FOR_TESTING
            .filter { it.onSale }
            .map { it.price }
    }

    @Benchmark
    fun loopListsTwice(bh: Blackhole) = bh.consumeResult {
        DATA_FOR_TESTING
            .filter { it.onSale }
            .map { it.price }
            .filter { it.isPresent }
            .map { it.get() }
    }

    @Benchmark
    fun loopSequencesOnce(bh: Blackhole) = bh.consumeResult {
        DATA_FOR_TESTING
            .asSequence()
            .filter { it.onSale }
            .map { it.price }
            .toList()
    }

    @Benchmark
    fun loopSequencesTwice(bh: Blackhole) = bh.consumeResult {
        DATA_FOR_TESTING
            .asSequence()
            .filter { it.onSale }
            .map { it.price }
            .filter { it.isPresent }
            .map { it.get() }
            .toList()
    }

And the results:

Benchmark                                  (N)  Mode  Cnt  Score    Error  Units
FilterMapLoopHelper.loopListsOnce         1000  avgt    5  0.006 ±  0.001  ms/op
FilterMapLoopHelper.loopListsOnce        10000  avgt    5  0.105 ±  0.004  ms/op
FilterMapLoopHelper.loopListsOnce       100000  avgt    5  1.673 ±  0.373  ms/op
FilterMapLoopHelper.loopListsTwice        1000  avgt    5  0.012 ±  0.001  ms/op
FilterMapLoopHelper.loopListsTwice       10000  avgt    5  0.191 ±  0.005  ms/op
FilterMapLoopHelper.loopListsTwice      100000  avgt    5  2.370 ±  0.336  ms/op
FilterMapLoopHelper.loopOptimalOnce       1000  avgt    5  0.003 ±  0.001  ms/op
FilterMapLoopHelper.loopOptimalOnce      10000  avgt    5  0.078 ±  0.004  ms/op
FilterMapLoopHelper.loopOptimalOnce     100000  avgt    5  0.980 ±  0.082  ms/op
FilterMapLoopHelper.loopOptimalTwice      1000  avgt    5  0.005 ±  0.001  ms/op
FilterMapLoopHelper.loopOptimalTwice     10000  avgt    5  0.074 ±  0.001  ms/op
FilterMapLoopHelper.loopOptimalTwice    100000  avgt    5  1.044 ±  0.091  ms/op
FilterMapLoopHelper.loopSequencesOnce     1000  avgt    5  0.006 ±  0.002  ms/op
FilterMapLoopHelper.loopSequencesOnce    10000  avgt    5  0.123 ±  0.002  ms/op
FilterMapLoopHelper.loopSequencesOnce   100000  avgt    5  1.513 ±  0.116  ms/op
FilterMapLoopHelper.loopSequencesTwice    1000  avgt    5  0.012 ±  0.001  ms/op
FilterMapLoopHelper.loopSequencesTwice   10000  avgt    5  0.219 ±  0.007  ms/op
FilterMapLoopHelper.loopSequencesTwice  100000  avgt    5  2.241 ±  0.319  ms/op

For loopOptimal, the values between Once and Twice weren’t too far apart, and for the N = 10000, Twice was actually faster. However, both loopLists and loopSequences took 1.5-2x longer on Twice. Comparing between them, loopOptimal was by far the fastest, even on Once, where only loopSequences barely had their error bars for touching on N = 100. Besides that, loopOptimal was clearly significantly faster than both of the others, for the Twice, N = 10000 case by over 2x.

This is, of course, a very simple example compared to what you might find in filter and map, but it’s not too out of the ordinary, and even complicated examples would benefit from having less overhead from how the filter and map are structured.

The JVM (at least the Java 8 JVM) will likely not inline this type of Sequence code in the machine code it compiles to. The JVM can only inline calls when the call site can only call one or two target methods (as measured when the JVM does its profiling). The call site is the JVM bytecode that does the call. The filter/map/etc transforms will likely each have only a single bytecode call site, but in a real application that uses sequences each transform will be used with more than three distinct lambdas. See my SO post here for more details.

The exception is small benchmarks such as the one in the previous post. There the JVM will most likely only see the few lambdas being benchmarked during its profiling. To get a more realistic number you need to make sure that the transforms are first called with at least three different lambdas a few thousand times, so that the JVM profiler sees gets to see them, before running the real benchmarks.

1 Like

I agree, several things could be done to improve on this example. (If anyone has any suggestions for more realistic work to be done on each filter and map step, I’d be happy to benchmark those as well, but these simple field accesses are surprisingly common in the code I’ve seen. Note again that regardless of performance or even how large the collections are and how often they’re used, the bytecode tax is going to be there no matter what.) The reason I didn’t continue benchmarking with more lambdas is because, to my surprise, even with only a single lambda each for filter and map, loopOptimal was significantly faster than loopSequences, so I already have the case that I was looking for even in loopSequences's most optimal scenario. I am interested in how more lambdas would impact loopSequences's behavior, so I’ll try adding those tonight.

Out of curiosity, what happens if you write the same benchmarks in Java 8? Java compiles lambdas differently so I am curious if that would make a difference.

I added some setup to make inlining the sequence lambdas more difficult:

        val random = Random()
        generateSequence { random.nextInt() }
                .take(5_000)
                .filter { it != 0}
                .map { it * 2 }
                .filter { it > Integer.MIN_VALUE }
                .map { it xor random.nextInt()}
                .count { it % 2 == 0 }

The results:

Benchmark                                      (N)  Mode  Cnt  Score   Error  Units
FilterMapLoopHelper.loopListsOnce           100000  avgt   10  1.677 ± 0.055  ms/op
FilterMapLoopHelper.loopListsTwice          100000  avgt   10  2.592 ± 0.022  ms/op
FilterMapLoopHelper.loopOptimalOnce         100000  avgt   10  1.012 ± 0.048  ms/op
FilterMapLoopHelper.loopOptimalTwice        100000  avgt   10  1.105 ± 0.058  ms/op
FilterMapLoopHelper.loopSequencesOnce       100000  avgt   10  2.339 ± 0.164  ms/op
FilterMapLoopHelper.loopSequencesOnceJava   100000  avgt   10  2.681 ± 0.215  ms/op
FilterMapLoopHelper.loopSequencesTwice      100000  avgt   10  3.394 ± 0.225  ms/op
FilterMapLoopHelper.loopSequencesTwiceJava  100000  avgt   10  3.396 ± 0.282  ms/op

That should settle things. loopOptimalOnce was over twice as fast as loopSequencesOnce, and loopOptimalTwice was over three times as fast as loopSequencesTwice. More interestingly, loopLists is significantly less performant than loopSequences in both cases, despite the fact that the IDE will suggest replacing inlined list operations with sequence operations.
(Writing the sequence code directly in Java didn’t appear to have an impact.)

Per a suggestion from the Slack channel, I also ran performance test on code using flatMap and mapNotNull:

@Benchmark
    fun loopFlatMapOnce(): List<Optional<Int>> =
            DATA_FOR_TESTING
                    .flatMap { listOfNotNull(it.takeIf { it.onSale }?.price) }

    @Benchmark
    fun loopFlatMapTwice(): List<Int> =
            DATA_FOR_TESTING
                    .flatMap { listOfNotNull(it.takeIf { it.onSale }?.price?.takeIf { it.isPresent }?.get()) }

    @Benchmark
    fun loopMapNotNullOnce(): List<Optional<Int>> =
            DATA_FOR_TESTING
                    .mapNotNull { it.takeIf { it.onSale }?.price }

    @Benchmark
    fun loopMapNotNullTwice(): List<Int> =
            DATA_FOR_TESTING
                    .mapNotNull { it.takeIf { it.onSale }?.price?.takeIf { it.isPresent }?.get() }

The results:

Benchmark                                   (N)  Mode  Cnt  Score   Error  Units
FilterMapLoopHelper.loopFlatMapOnce      100000  avgt   10  8.329 ± 0.132  ms/op
FilterMapLoopHelper.loopFlatMapTwice     100000  avgt   10  7.414 ± 0.079  ms/op
FilterMapLoopHelper.loopMapNotNullOnce   100000  avgt   10  1.140 ± 0.030  ms/op
FilterMapLoopHelper.loopMapNotNullTwice  100000  avgt   10  1.204 ± 0.067  ms/op
FilterMapLoopHelper.loopOptimalOnce      100000  avgt   10  1.087 ± 0.060  ms/op
FilterMapLoopHelper.loopOptimalTwice     100000  avgt   10  1.073 ± 0.017  ms/op

We can see that creating N/2 or N/4 additional List and Iterator objects is terrible for performance, and the null checks introduced by using mapNotNull do have a measurable, but not absolutely terrible, impact. Regardless, if one wanted to do filter { ... }.take(50).map { ... }, suddenly `mapNotNull is no longer a reasonable option.

I have a similar idea which I named it “inline yield”. Notice that your FilteringStream example

inline class FilteringStream<T>(private val sourceStream: Stream<T>, private inline val predicate: (T) -> Boolean): Stream<T> {
    override fun forEach(operation: (T) -> Unit) {
        sourceStream.forEach { item: T ->
            if (predicate(item)) {
                operation(item)
            }
        }
    }
}

can be written with yield as

inline fun Sequence<T>.filter<T>(val predicate: (T) -> Boolean): Sequence<T> = sequence {
    forEach { item: T ->
        if (predicate(item)) {
            yield(item)
        }
    }
}

My idea is it can perform the optimization similar to your description by marking this function inline.
I have no idea whether this optimization can be generalized to all coroutine.

1 Like

I agree, if we could inline yield with some perhaps difficult, but very possible inlining steps, that would work as well. It would also make it possible to zip two of these inlined sequences together, as otherwise at least one of them would have to be fully iterator-based instead, even though we easily know that the optimal zip of two sequences should only require the original iterators of each sequence’s source list.

Another update: I ran a benchmark using a modified version of the Renaissance Benchmark’s Scrabble. I converted it into Kotlin, one using typical functional sequences and the other manually inlining all sequence operations. I figured that, as an existing benchmark, this would give a better idea of the impact of sequences.

The results:

Benchmark                                   Mode  Cnt    Score    Error  Units
ScrabbleBenchmark.kotlinScrabbleFunctional  avgt   20  627.553 ± 10.382  ms/op
ScrabbleBenchmark.kotlinScrabbleImperative  avgt   20  544.754 ± 13.472  ms/op

That’s about a 13% improvement just from manually inlining a few sequences.