Yes, for such things one would have to use existing Sequence
s. The inline class
es here exist only for the purpose of coupling an object with an inlined lambda and attaching behavior to it, I can think of no better representation for that than the existing class
. Other use cases would require using Sequence
or storing the Stream
into a collection.
As promised, I completed my benchmarks, using the Java Microbenchmark Harness and mostly following the instructions here.
The very dumb data class:
data class Item(val onSale: Boolean, val price: Optional<Int>)
The setup:
val random = Random()
val DATA_FOR_TESTING: MutableList<Item> = ArrayList()
for (i in 0 until N) {
val onSale = random.nextBoolean()
val hasPrice = random.nextBoolean()
val price = if (hasPrice) Optional.of(random.nextInt()) else Optional.empty()
DATA_FOR_TESTING.add(Item(onSale, price))
}
The benchmarks:
fun Blackhole.consumeResult(test: () -> Any) = consume(test.invoke())
@Benchmark
fun loopOptimalOnce(bh: Blackhole)= bh.consumeResult {
val resultList = mutableListOf<Optional<Int>>()
DATA_FOR_TESTING.forEach { item ->
if (item.onSale) {
val price = item.price
resultList.add(price)
}
}
}
@Benchmark
fun loopOptimalTwice(bh: Blackhole)= bh.consumeResult {
val resultList = mutableListOf<Int>()
DATA_FOR_TESTING.forEach { item ->
if (item.onSale) {
val price = item.price
if (price.isPresent) {
resultList.add(price.get())
}
}
}
}
@Benchmark
fun loopListsOnce(bh: Blackhole) = bh.consumeResult {
DATA_FOR_TESTING
.filter { it.onSale }
.map { it.price }
}
@Benchmark
fun loopListsTwice(bh: Blackhole) = bh.consumeResult {
DATA_FOR_TESTING
.filter { it.onSale }
.map { it.price }
.filter { it.isPresent }
.map { it.get() }
}
@Benchmark
fun loopSequencesOnce(bh: Blackhole) = bh.consumeResult {
DATA_FOR_TESTING
.asSequence()
.filter { it.onSale }
.map { it.price }
.toList()
}
@Benchmark
fun loopSequencesTwice(bh: Blackhole) = bh.consumeResult {
DATA_FOR_TESTING
.asSequence()
.filter { it.onSale }
.map { it.price }
.filter { it.isPresent }
.map { it.get() }
.toList()
}
And the results:
Benchmark (N) Mode Cnt Score Error Units
FilterMapLoopHelper.loopListsOnce 1000 avgt 5 0.006 ± 0.001 ms/op
FilterMapLoopHelper.loopListsOnce 10000 avgt 5 0.105 ± 0.004 ms/op
FilterMapLoopHelper.loopListsOnce 100000 avgt 5 1.673 ± 0.373 ms/op
FilterMapLoopHelper.loopListsTwice 1000 avgt 5 0.012 ± 0.001 ms/op
FilterMapLoopHelper.loopListsTwice 10000 avgt 5 0.191 ± 0.005 ms/op
FilterMapLoopHelper.loopListsTwice 100000 avgt 5 2.370 ± 0.336 ms/op
FilterMapLoopHelper.loopOptimalOnce 1000 avgt 5 0.003 ± 0.001 ms/op
FilterMapLoopHelper.loopOptimalOnce 10000 avgt 5 0.078 ± 0.004 ms/op
FilterMapLoopHelper.loopOptimalOnce 100000 avgt 5 0.980 ± 0.082 ms/op
FilterMapLoopHelper.loopOptimalTwice 1000 avgt 5 0.005 ± 0.001 ms/op
FilterMapLoopHelper.loopOptimalTwice 10000 avgt 5 0.074 ± 0.001 ms/op
FilterMapLoopHelper.loopOptimalTwice 100000 avgt 5 1.044 ± 0.091 ms/op
FilterMapLoopHelper.loopSequencesOnce 1000 avgt 5 0.006 ± 0.002 ms/op
FilterMapLoopHelper.loopSequencesOnce 10000 avgt 5 0.123 ± 0.002 ms/op
FilterMapLoopHelper.loopSequencesOnce 100000 avgt 5 1.513 ± 0.116 ms/op
FilterMapLoopHelper.loopSequencesTwice 1000 avgt 5 0.012 ± 0.001 ms/op
FilterMapLoopHelper.loopSequencesTwice 10000 avgt 5 0.219 ± 0.007 ms/op
FilterMapLoopHelper.loopSequencesTwice 100000 avgt 5 2.241 ± 0.319 ms/op
For loopOptimal
, the values between Once
and Twice
weren’t too far apart, and for the N = 10000
, Twice
was actually faster. However, both loopLists
and loopSequences
took 1.5-2x longer on Twice
. Comparing between them, loopOptimal
was by far the fastest, even on Once
, where only loopSequences
barely had their error bars for touching on N = 100
. Besides that, loopOptimal
was clearly significantly faster than both of the others, for the Twice, N = 10000
case by over 2x.
This is, of course, a very simple example compared to what you might find in filter
and map
, but it’s not too out of the ordinary, and even complicated examples would benefit from having less overhead from how the filter
and map
are structured.
The JVM (at least the Java 8 JVM) will likely not inline this type of Sequence code in the machine code it compiles to. The JVM can only inline calls when the call site can only call one or two target methods (as measured when the JVM does its profiling). The call site is the JVM bytecode that does the call. The filter
/map
/etc transforms will likely each have only a single bytecode call site, but in a real application that uses sequences each transform will be used with more than three distinct lambdas. See my SO post here for more details.
The exception is small benchmarks such as the one in the previous post. There the JVM will most likely only see the few lambdas being benchmarked during its profiling. To get a more realistic number you need to make sure that the transforms are first called with at least three different lambdas a few thousand times, so that the JVM profiler sees gets to see them, before running the real benchmarks.
I agree, several things could be done to improve on this example. (If anyone has any suggestions for more realistic work to be done on each filter
and map
step, I’d be happy to benchmark those as well, but these simple field accesses are surprisingly common in the code I’ve seen. Note again that regardless of performance or even how large the collections are and how often they’re used, the bytecode tax is going to be there no matter what.) The reason I didn’t continue benchmarking with more lambdas is because, to my surprise, even with only a single lambda each for filter
and map
, loopOptimal
was significantly faster than loopSequences
, so I already have the case that I was looking for even in loopSequences
’s most optimal scenario. I am interested in how more lambdas would impact loopSequences
’s behavior, so I’ll try adding those tonight.
Out of curiosity, what happens if you write the same benchmarks in Java 8? Java compiles lambdas differently so I am curious if that would make a difference.
I added some setup to make inlining the sequence lambdas more difficult:
val random = Random()
generateSequence { random.nextInt() }
.take(5_000)
.filter { it != 0}
.map { it * 2 }
.filter { it > Integer.MIN_VALUE }
.map { it xor random.nextInt()}
.count { it % 2 == 0 }
The results:
Benchmark (N) Mode Cnt Score Error Units
FilterMapLoopHelper.loopListsOnce 100000 avgt 10 1.677 ± 0.055 ms/op
FilterMapLoopHelper.loopListsTwice 100000 avgt 10 2.592 ± 0.022 ms/op
FilterMapLoopHelper.loopOptimalOnce 100000 avgt 10 1.012 ± 0.048 ms/op
FilterMapLoopHelper.loopOptimalTwice 100000 avgt 10 1.105 ± 0.058 ms/op
FilterMapLoopHelper.loopSequencesOnce 100000 avgt 10 2.339 ± 0.164 ms/op
FilterMapLoopHelper.loopSequencesOnceJava 100000 avgt 10 2.681 ± 0.215 ms/op
FilterMapLoopHelper.loopSequencesTwice 100000 avgt 10 3.394 ± 0.225 ms/op
FilterMapLoopHelper.loopSequencesTwiceJava 100000 avgt 10 3.396 ± 0.282 ms/op
That should settle things. loopOptimalOnce
was over twice as fast as loopSequencesOnce
, and loopOptimalTwice
was over three times as fast as loopSequencesTwice
. More interestingly, loopLists
is significantly less performant than loopSequences
in both cases, despite the fact that the IDE will suggest replacing inlined list operations with sequence operations.
(Writing the sequence code directly in Java didn’t appear to have an impact.)
Per a suggestion from the Slack channel, I also ran performance test on code using flatMap
and mapNotNull
:
@Benchmark
fun loopFlatMapOnce(): List<Optional<Int>> =
DATA_FOR_TESTING
.flatMap { listOfNotNull(it.takeIf { it.onSale }?.price) }
@Benchmark
fun loopFlatMapTwice(): List<Int> =
DATA_FOR_TESTING
.flatMap { listOfNotNull(it.takeIf { it.onSale }?.price?.takeIf { it.isPresent }?.get()) }
@Benchmark
fun loopMapNotNullOnce(): List<Optional<Int>> =
DATA_FOR_TESTING
.mapNotNull { it.takeIf { it.onSale }?.price }
@Benchmark
fun loopMapNotNullTwice(): List<Int> =
DATA_FOR_TESTING
.mapNotNull { it.takeIf { it.onSale }?.price?.takeIf { it.isPresent }?.get() }
The results:
Benchmark (N) Mode Cnt Score Error Units
FilterMapLoopHelper.loopFlatMapOnce 100000 avgt 10 8.329 ± 0.132 ms/op
FilterMapLoopHelper.loopFlatMapTwice 100000 avgt 10 7.414 ± 0.079 ms/op
FilterMapLoopHelper.loopMapNotNullOnce 100000 avgt 10 1.140 ± 0.030 ms/op
FilterMapLoopHelper.loopMapNotNullTwice 100000 avgt 10 1.204 ± 0.067 ms/op
FilterMapLoopHelper.loopOptimalOnce 100000 avgt 10 1.087 ± 0.060 ms/op
FilterMapLoopHelper.loopOptimalTwice 100000 avgt 10 1.073 ± 0.017 ms/op
We can see that creating N/2 or N/4 additional List
and Iterator
objects is terrible for performance, and the null checks introduced by using mapNotNull
do have a measurable, but not absolutely terrible, impact. Regardless, if one wanted to do filter { ... }.take(50).map { ... }
, suddenly `mapNotNull is no longer a reasonable option.
I have a similar idea which I named it “inline yield”. Notice that your FilteringStream
example
inline class FilteringStream<T>(private val sourceStream: Stream<T>, private inline val predicate: (T) -> Boolean): Stream<T> {
override fun forEach(operation: (T) -> Unit) {
sourceStream.forEach { item: T ->
if (predicate(item)) {
operation(item)
}
}
}
}
can be written with yield
as
inline fun Sequence<T>.filter<T>(val predicate: (T) -> Boolean): Sequence<T> = sequence {
forEach { item: T ->
if (predicate(item)) {
yield(item)
}
}
}
My idea is it can perform the optimization similar to your description by marking this function inline
.
I have no idea whether this optimization can be generalized to all coroutine.
I agree, if we could inline yield
with some perhaps difficult, but very possible inlining steps, that would work as well. It would also make it possible to zip
two of these inlined sequences together, as otherwise at least one of them would have to be fully iterator-based instead, even though we easily know that the optimal zip of two sequences should only require the original iterators of each sequence’s source list.
Another update: I ran a benchmark using a modified version of the Renaissance Benchmark’s Scrabble. I converted it into Kotlin, one using typical functional sequences and the other manually inlining all sequence operations. I figured that, as an existing benchmark, this would give a better idea of the impact of sequences.
The results:
Benchmark Mode Cnt Score Error Units
ScrabbleBenchmark.kotlinScrabbleFunctional avgt 20 627.553 ± 10.382 ms/op
ScrabbleBenchmark.kotlinScrabbleImperative avgt 20 544.754 ± 13.472 ms/op
That’s about a 13% improvement just from manually inlining a few sequences.
@gidds, @madmax1028 @chipays – What are your thoughts on the 13% performance improvements Derek mentioned here? I’m also wondering if this is something that can be added.
Interesting read!
So to conclude should we be looking to leverage asSequence() where possible even for simple operations / small lists to avoid the immediate list allocations or not?
If it is not always a clear win in terms of memory / performance then it sounds like I would be better off not making it a concern of mine when writing new code and leave it to the very rare occasion a bottleneck is identified and one revisits code in the attempt of making it more performant.
Depend on your usecase. On average kotlin compiles a chain of map/filter/… to a simple loop so no intermediate allocations.
Sequences might help in case of very long (even infinite) containers of elements, or when what do you with each element is somehow very heavyweight.
That’s the tricky part, asSequence()
avoids the intermediate list allocations, but at the cost of lambda overhead. For tiny operations, either will probably be fine, but for larger collections you’d have to run performance tests to find out which drawback hurts the least, which is why I advocate for enhancing the language’s inlining features so that developers don’t have to make the choice between optimal readability and optimal performance.
Why do you say that Kotlin would optimize a map
/filter
chain to use no allocations? As I’ve shown, filter
followed by map
will result in an unnecessary allocation of an intermediate list, and an direct optimization would be impossible because the author may have intended all of the filter
operations to take place before the map
operations. If there was an optimization, the manual optimization wouldn’t have shown its significant speed advantage.
Ohh, this is interesting read.
I was thinking about the same problem and if / how it could be improved. My gut feeling was always that the performance hit is relatively small, so it is not worth working on. But I was too lazy to verify this. 13% difference is definitely not worth added complexity, but 1.5-2.5x difference - well, maybe.
Anyway, regardless of the above, it is a very interesting topic and good stretch for our brains
Did you publish your benchmarks anywhere? By looking at very first benchmarks you posted here, you don’t consume the results in optimal
case. Because of that, JVM could entirely remove all your code and in fact benchmark an empty function. Performance results seem like this is not happening, but still, I think it should be fixed.
The 13% performance improvement might not always be worth the individual developers replacing their use of Kotlin’s convenience functions with properly optimized loops, but enabling developers to get that benefit conveniently by doing more work on the compiler, letting all developers get a 13% performance improvement for little additional effort, would be phenomenal. Also, for the benchmarks, I called Blackhole.consumeResult
, which consumes the result of the method so that it doesn’t get optimized away, though I later learned that just returning the result from the benchmark method is sufficient.
Yes, you used blackhole, but you didn’t return resultList
, so it was unused. You might make the same mistake in other benchmarks as well, because the classic loop is the only case when we create the result collection manually and we have to return it at the end.
Ah, I see, I was mistakenly consuming a Unit
instead of the list I was creating! I’ll try to revive my benchmarking setup sometime to see if anything changes.
Why the developer has not yet given an answer how it works. I think this is a good place to discuss with them