Sequences and Inlined Lambdas

Yes, for such things one would have to use existing Sequences. The inline classes here exist only for the purpose of coupling an object with an inlined lambda and attaching behavior to it, I can think of no better representation for that than the existing class. Other use cases would require using Sequence or storing the Stream into a collection.

As promised, I completed my benchmarks, using the Java Microbenchmark Harness and mostly following the instructions here.

The very dumb data class:
data class Item(val onSale: Boolean, val price: Optional<Int>)
The setup:

        val random = Random()
        val DATA_FOR_TESTING: MutableList<Item> = ArrayList()
        for (i in 0 until N) {
            val onSale = random.nextBoolean()
            val hasPrice = random.nextBoolean()
            val price = if (hasPrice) Optional.of(random.nextInt()) else Optional.empty()
            DATA_FOR_TESTING.add(Item(onSale, price))
        }

The benchmarks:

    fun Blackhole.consumeResult(test: () -> Any) = consume(test.invoke())

    @Benchmark
    fun loopOptimalOnce(bh: Blackhole)= bh.consumeResult {
        val resultList = mutableListOf<Optional<Int>>()
        DATA_FOR_TESTING.forEach { item ->
            if (item.onSale) {
                val price = item.price
                resultList.add(price)
            }
        }
    }

    @Benchmark
    fun loopOptimalTwice(bh: Blackhole)= bh.consumeResult {
        val resultList = mutableListOf<Int>()
        DATA_FOR_TESTING.forEach { item ->
            if (item.onSale) {
                val price = item.price
                if (price.isPresent) {
                    resultList.add(price.get())
                }
            }
        }
    }

    @Benchmark
    fun loopListsOnce(bh: Blackhole) = bh.consumeResult {
        DATA_FOR_TESTING
            .filter { it.onSale }
            .map { it.price }
    }

    @Benchmark
    fun loopListsTwice(bh: Blackhole) = bh.consumeResult {
        DATA_FOR_TESTING
            .filter { it.onSale }
            .map { it.price }
            .filter { it.isPresent }
            .map { it.get() }
    }

    @Benchmark
    fun loopSequencesOnce(bh: Blackhole) = bh.consumeResult {
        DATA_FOR_TESTING
            .asSequence()
            .filter { it.onSale }
            .map { it.price }
            .toList()
    }

    @Benchmark
    fun loopSequencesTwice(bh: Blackhole) = bh.consumeResult {
        DATA_FOR_TESTING
            .asSequence()
            .filter { it.onSale }
            .map { it.price }
            .filter { it.isPresent }
            .map { it.get() }
            .toList()
    }

And the results:

Benchmark                                  (N)  Mode  Cnt  Score    Error  Units
FilterMapLoopHelper.loopListsOnce         1000  avgt    5  0.006 ±  0.001  ms/op
FilterMapLoopHelper.loopListsOnce        10000  avgt    5  0.105 ±  0.004  ms/op
FilterMapLoopHelper.loopListsOnce       100000  avgt    5  1.673 ±  0.373  ms/op
FilterMapLoopHelper.loopListsTwice        1000  avgt    5  0.012 ±  0.001  ms/op
FilterMapLoopHelper.loopListsTwice       10000  avgt    5  0.191 ±  0.005  ms/op
FilterMapLoopHelper.loopListsTwice      100000  avgt    5  2.370 ±  0.336  ms/op
FilterMapLoopHelper.loopOptimalOnce       1000  avgt    5  0.003 ±  0.001  ms/op
FilterMapLoopHelper.loopOptimalOnce      10000  avgt    5  0.078 ±  0.004  ms/op
FilterMapLoopHelper.loopOptimalOnce     100000  avgt    5  0.980 ±  0.082  ms/op
FilterMapLoopHelper.loopOptimalTwice      1000  avgt    5  0.005 ±  0.001  ms/op
FilterMapLoopHelper.loopOptimalTwice     10000  avgt    5  0.074 ±  0.001  ms/op
FilterMapLoopHelper.loopOptimalTwice    100000  avgt    5  1.044 ±  0.091  ms/op
FilterMapLoopHelper.loopSequencesOnce     1000  avgt    5  0.006 ±  0.002  ms/op
FilterMapLoopHelper.loopSequencesOnce    10000  avgt    5  0.123 ±  0.002  ms/op
FilterMapLoopHelper.loopSequencesOnce   100000  avgt    5  1.513 ±  0.116  ms/op
FilterMapLoopHelper.loopSequencesTwice    1000  avgt    5  0.012 ±  0.001  ms/op
FilterMapLoopHelper.loopSequencesTwice   10000  avgt    5  0.219 ±  0.007  ms/op
FilterMapLoopHelper.loopSequencesTwice  100000  avgt    5  2.241 ±  0.319  ms/op

For loopOptimal, the values between Once and Twice weren’t too far apart, and for the N = 10000, Twice was actually faster. However, both loopLists and loopSequences took 1.5-2x longer on Twice. Comparing between them, loopOptimal was by far the fastest, even on Once, where only loopSequences barely had their error bars for touching on N = 100. Besides that, loopOptimal was clearly significantly faster than both of the others, for the Twice, N = 10000 case by over 2x.

This is, of course, a very simple example compared to what you might find in filter and map, but it’s not too out of the ordinary, and even complicated examples would benefit from having less overhead from how the filter and map are structured.

The JVM (at least the Java 8 JVM) will likely not inline this type of Sequence code in the machine code it compiles to. The JVM can only inline calls when the call site can only call one or two target methods (as measured when the JVM does its profiling). The call site is the JVM bytecode that does the call. The filter/map/etc transforms will likely each have only a single bytecode call site, but in a real application that uses sequences each transform will be used with more than three distinct lambdas. See my SO post here for more details.

The exception is small benchmarks such as the one in the previous post. There the JVM will most likely only see the few lambdas being benchmarked during its profiling. To get a more realistic number you need to make sure that the transforms are first called with at least three different lambdas a few thousand times, so that the JVM profiler sees gets to see them, before running the real benchmarks.

1 Like

I agree, several things could be done to improve on this example. (If anyone has any suggestions for more realistic work to be done on each filter and map step, I’d be happy to benchmark those as well, but these simple field accesses are surprisingly common in the code I’ve seen. Note again that regardless of performance or even how large the collections are and how often they’re used, the bytecode tax is going to be there no matter what.) The reason I didn’t continue benchmarking with more lambdas is because, to my surprise, even with only a single lambda each for filter and map, loopOptimal was significantly faster than loopSequences, so I already have the case that I was looking for even in loopSequences’s most optimal scenario. I am interested in how more lambdas would impact loopSequences’s behavior, so I’ll try adding those tonight.

Out of curiosity, what happens if you write the same benchmarks in Java 8? Java compiles lambdas differently so I am curious if that would make a difference.

I added some setup to make inlining the sequence lambdas more difficult:

        val random = Random()
        generateSequence { random.nextInt() }
                .take(5_000)
                .filter { it != 0}
                .map { it * 2 }
                .filter { it > Integer.MIN_VALUE }
                .map { it xor random.nextInt()}
                .count { it % 2 == 0 }

The results:

Benchmark                                      (N)  Mode  Cnt  Score   Error  Units
FilterMapLoopHelper.loopListsOnce           100000  avgt   10  1.677 ± 0.055  ms/op
FilterMapLoopHelper.loopListsTwice          100000  avgt   10  2.592 ± 0.022  ms/op
FilterMapLoopHelper.loopOptimalOnce         100000  avgt   10  1.012 ± 0.048  ms/op
FilterMapLoopHelper.loopOptimalTwice        100000  avgt   10  1.105 ± 0.058  ms/op
FilterMapLoopHelper.loopSequencesOnce       100000  avgt   10  2.339 ± 0.164  ms/op
FilterMapLoopHelper.loopSequencesOnceJava   100000  avgt   10  2.681 ± 0.215  ms/op
FilterMapLoopHelper.loopSequencesTwice      100000  avgt   10  3.394 ± 0.225  ms/op
FilterMapLoopHelper.loopSequencesTwiceJava  100000  avgt   10  3.396 ± 0.282  ms/op

That should settle things. loopOptimalOnce was over twice as fast as loopSequencesOnce, and loopOptimalTwice was over three times as fast as loopSequencesTwice. More interestingly, loopLists is significantly less performant than loopSequences in both cases, despite the fact that the IDE will suggest replacing inlined list operations with sequence operations.
(Writing the sequence code directly in Java didn’t appear to have an impact.)

Per a suggestion from the Slack channel, I also ran performance test on code using flatMap and mapNotNull:

@Benchmark
    fun loopFlatMapOnce(): List<Optional<Int>> =
            DATA_FOR_TESTING
                    .flatMap { listOfNotNull(it.takeIf { it.onSale }?.price) }

    @Benchmark
    fun loopFlatMapTwice(): List<Int> =
            DATA_FOR_TESTING
                    .flatMap { listOfNotNull(it.takeIf { it.onSale }?.price?.takeIf { it.isPresent }?.get()) }

    @Benchmark
    fun loopMapNotNullOnce(): List<Optional<Int>> =
            DATA_FOR_TESTING
                    .mapNotNull { it.takeIf { it.onSale }?.price }

    @Benchmark
    fun loopMapNotNullTwice(): List<Int> =
            DATA_FOR_TESTING
                    .mapNotNull { it.takeIf { it.onSale }?.price?.takeIf { it.isPresent }?.get() }

The results:

Benchmark                                   (N)  Mode  Cnt  Score   Error  Units
FilterMapLoopHelper.loopFlatMapOnce      100000  avgt   10  8.329 ± 0.132  ms/op
FilterMapLoopHelper.loopFlatMapTwice     100000  avgt   10  7.414 ± 0.079  ms/op
FilterMapLoopHelper.loopMapNotNullOnce   100000  avgt   10  1.140 ± 0.030  ms/op
FilterMapLoopHelper.loopMapNotNullTwice  100000  avgt   10  1.204 ± 0.067  ms/op
FilterMapLoopHelper.loopOptimalOnce      100000  avgt   10  1.087 ± 0.060  ms/op
FilterMapLoopHelper.loopOptimalTwice     100000  avgt   10  1.073 ± 0.017  ms/op

We can see that creating N/2 or N/4 additional List and Iterator objects is terrible for performance, and the null checks introduced by using mapNotNull do have a measurable, but not absolutely terrible, impact. Regardless, if one wanted to do filter { ... }.take(50).map { ... }, suddenly `mapNotNull is no longer a reasonable option.

I have a similar idea which I named it “inline yield”. Notice that your FilteringStream example

inline class FilteringStream<T>(private val sourceStream: Stream<T>, private inline val predicate: (T) -> Boolean): Stream<T> {
    override fun forEach(operation: (T) -> Unit) {
        sourceStream.forEach { item: T ->
            if (predicate(item)) {
                operation(item)
            }
        }
    }
}

can be written with yield as

inline fun Sequence<T>.filter<T>(val predicate: (T) -> Boolean): Sequence<T> = sequence {
    forEach { item: T ->
        if (predicate(item)) {
            yield(item)
        }
    }
}

My idea is it can perform the optimization similar to your description by marking this function inline.
I have no idea whether this optimization can be generalized to all coroutine.

1 Like

I agree, if we could inline yield with some perhaps difficult, but very possible inlining steps, that would work as well. It would also make it possible to zip two of these inlined sequences together, as otherwise at least one of them would have to be fully iterator-based instead, even though we easily know that the optimal zip of two sequences should only require the original iterators of each sequence’s source list.

Another update: I ran a benchmark using a modified version of the Renaissance Benchmark’s Scrabble. I converted it into Kotlin, one using typical functional sequences and the other manually inlining all sequence operations. I figured that, as an existing benchmark, this would give a better idea of the impact of sequences.

The results:

Benchmark                                   Mode  Cnt    Score    Error  Units
ScrabbleBenchmark.kotlinScrabbleFunctional  avgt   20  627.553 ± 10.382  ms/op
ScrabbleBenchmark.kotlinScrabbleImperative  avgt   20  544.754 ± 13.472  ms/op

That’s about a 13% improvement just from manually inlining a few sequences.

@gidds, @madmax1028 @chipays – What are your thoughts on the 13% performance improvements Derek mentioned here? I’m also wondering if this is something that can be added.

Interesting read!

So to conclude should we be looking to leverage asSequence() where possible even for simple operations / small lists to avoid the immediate list allocations or not?

If it is not always a clear win in terms of memory / performance then it sounds like I would be better off not making it a concern of mine when writing new code and leave it to the very rare occasion a bottleneck is identified and one revisits code in the attempt of making it more performant.

Depend on your usecase. On average kotlin compiles a chain of map/filter/… to a simple loop so no intermediate allocations.
Sequences might help in case of very long (even infinite) containers of elements, or when what do you with each element is somehow very heavyweight.

That’s the tricky part, asSequence() avoids the intermediate list allocations, but at the cost of lambda overhead. For tiny operations, either will probably be fine, but for larger collections you’d have to run performance tests to find out which drawback hurts the least, which is why I advocate for enhancing the language’s inlining features so that developers don’t have to make the choice between optimal readability and optimal performance.

1 Like

Why do you say that Kotlin would optimize a map/filter chain to use no allocations? As I’ve shown, filter followed by map will result in an unnecessary allocation of an intermediate list, and an direct optimization would be impossible because the author may have intended all of the filter operations to take place before the map operations. If there was an optimization, the manual optimization wouldn’t have shown its significant speed advantage.

Ohh, this is interesting read.

I was thinking about the same problem and if / how it could be improved. My gut feeling was always that the performance hit is relatively small, so it is not worth working on. But I was too lazy to verify this. 13% difference is definitely not worth added complexity, but 1.5-2.5x difference - well, maybe.

Anyway, regardless of the above, it is a very interesting topic and good stretch for our brains :slight_smile:

Did you publish your benchmarks anywhere? By looking at very first benchmarks you posted here, you don’t consume the results in optimal case. Because of that, JVM could entirely remove all your code and in fact benchmark an empty function. Performance results seem like this is not happening, but still, I think it should be fixed.

1 Like

The 13% performance improvement might not always be worth the individual developers replacing their use of Kotlin’s convenience functions with properly optimized loops, but enabling developers to get that benefit conveniently by doing more work on the compiler, letting all developers get a 13% performance improvement for little additional effort, would be phenomenal. Also, for the benchmarks, I called Blackhole.consumeResult, which consumes the result of the method so that it doesn’t get optimized away, though I later learned that just returning the result from the benchmark method is sufficient.

Yes, you used blackhole, but you didn’t return resultList, so it was unused. You might make the same mistake in other benchmarks as well, because the classic loop is the only case when we create the result collection manually and we have to return it at the end.

Ah, I see, I was mistakenly consuming a Unit instead of the list I was creating! I’ll try to revive my benchmarking setup sometime to see if anything changes.

Why the developer has not yet given an answer how it works. I think this is a good place to discuss with them :thinking: