asSequence().filter{} vs .filter{}

Recently, and by pure accident I figured out that collection.filter{}(or any other processing functions) creates list object. concatenating .filter{}.map{} will create multiple lists which is not efficient.

to do it efficiently I need to call .asSequence() and then collect to something like .toList().

Why was it designed this way?

First of all it is not trivial and actually requires digging into the documentation, more over I have to remember it every time. It is actually much simpler in java as I always have to invoke “.stream”.

It can produce funky mistakes. for example

val x = list.asSequence().distinct() // vs list.distinct()
x.contains("Hello World")

x will get Sequence type which has contains function, which actually does linear search. If asSequence is omitted I will get hash set and code will do O(1).

Intellij does not always help with highlighting a warning, so I am getting a code base where every filter/map creates new collection.

I think blank map/filter/ect. functions are confusing and can be a source of bad performance errors.

Am I missing the beauty of that design?

What exactly is your preferred approach? We can have only lazy transformations, only eager transformations or both of them. Kotlin have both which seems to be the most flexible, because the developer can choose depending on the case.

After using Kotlin for some time, I had to jump back to Java and do some collection transformations. Even though I have plenty of past Java experience, I was surprised when I couldn’t find simple filter() or map() for collections. I totally forgot how inconvenient and cumbersome Java is :-/

Regarding your example: we can say the same about e.g. List.contains() method in Java. We should remove it, because otherwise someone who doesn’t understand data structures will use it repeatedly with O(n) complexity instead of converting to set to get O(1). As a developer we have to understand data structures we use and choose them intentionally. There is no way around it.

2 Likes

I prefer less confusion and things to remember.
Kotlin’s way of working with streams is primarily focused on removing unneeded code.
So rather having functions that return actual collections, I would prefer

list.map{}.toList() //or any other collector type ​
map/filter and other similar functions should only operate on sequences . In this case I still get smaller code but with single behavior. I don’t have to remember that .filter in one contexts will give me actual object and sequence in another.

My example with contains was more about the confusion that can happen when
list.map{}.distinct() replaced by list.asSequence().map{}.distinct(). As kotlin does not require types in variable definition, simple addition of asSequence can change performance characteristics of code without developer even noticing it.

I understand your point. I think I just have an apposite opinion on what is confusing.

One thing that I like in Kotlin and its stdlib is that it tries to support same kinds of transformations for many different types of data and keep its API consistent. Let’s get filter() as an example. We have filter() function/extension for: iterables, maps (it’s not an iterable), arrays, primitive arrays, sequences, strings, char sequences, flows and ranges (actually, ranges are iterables). forEach() probably has even more uses, e.g. it works for iterators (and iterables!).

Each above case is a separate implementation of filter(), in some cases it works in a very similar way (e.g. list vs array) and sometimes it does a much different thing (list vs sequence vs flow). But conceptually they’re all almost the same, they have the same or similar API and for these reasons in most cases they do exactly what the developer expects. So for me it’s actually less confusing and there’s less information to remember.

Yes and again: the same can be said about Java when switching between lists and sets, because they’re both collections. Switching between linked lists and array lists is even worse, because it doesn’t at all affect types. I don’t really think the language can protect developers from using data structures carelessly.

Whenever I need to perform some data transformations I intentionally choose between going eager or lazy. Usually eager by default, lazy if needed. Functions returning sequences are rather rare and in most cases they either have “sequence” in their name, they’re part of some bigger library that works mostly on sequences or they’re used in other scenario where they could be expected, e.g. when reading files. So for me it seems hard to confuse collections with sequences.

For this specific case, if the resulting values are not nullable, there’s a more efficient alternative: call mapNotNull{ }, and have the lambda return null for values to be filtered out.

(Of course, this doesn’t address the more general point, but it’s always worth being aware of options.)

As explained in documentation:

So, the sequences let you avoid building results of intermediate steps, therefore improving the performance of the whole collection processing chain. However, the lazy nature of sequences adds some overhead which may be significant when processing smaller collections or doing simpler computations. Hence, you should consider both Sequence and Iterable and decide which one is better for your case.

And just to give an example of “simpler computation”, where using sequence is actually less performant, consider a situation when you need only a single operation: collection.map{} or collection.asSequence().map{}.toList().

In the collection.map{} case you benefit from the fact that the size of the original collection is known - the results list is expected to have same size, so backing array of this size is allocated once and then filled with mapped values.

In the collection.asSequence().map{}.toList() case the size is unknown. Depending on the size of original collection, significant time may be spent on resizing the backing array and copying elements.

2 Likes

I think the performance impact is much harder to evaluate. My gut feeling (never tried to run benchmarks) is:

Benefits of eager processing:

  • Implementation is simpler and simpler often means - faster. There is no overhead of on-demand processing.
  • Sequences may be harder to optimize by the compiler/JIT. Collections are inlined, so there are no function calls in the bytecode and the compiler/JIT has very easy job understanding what is going on. Sequences are a chain of data transformers that call each other and I’m not sure how good is JIT at optimizing this.

Benefits of lazy processing:

  • It is a clear winner if at some point we only have to process e.g. first X elements.
  • It can be more memory efficient in some cases, e.g. if we read data while processing it (reading files line by line, etc.). My guess would be that it is not more memory-efficient in general, but I may be wrong.
  • It is potentially better for CPU cache.

I’m interested in any real benchmarks to see how these two compare. As said earlier, I personally use eager processing by default and switch to sequences in specific cases, when I believe it will be more performant.

.distinct on list.distinct() vs. list.asSequence.distinct() has very different effect. First returns set data structure, second sequence with unique elements.

Sure language can’t protect from careless developer, but it can minimize chance of error. For example, I think it is better to keep distinct function name for sequence manipulations only, so non-sequence related code needs to call toSet e.g. list.filter{}.toSet(). In this case when I transform data structures, I can clearly demonstrate new data structure type to the reader. If I work with sequences, I can use more generic terms like distinct, unique etc.

This function name overloading harms the language rather helping it.

I’m sorry, but I don’t understand what you mean.

list.distinct() “returns a list containing only distinct elements from the given collection.”

list.asSequence().distinct() “returns a sequence containing only distinct elements from the given sequence.”

Which is exactly as I would expect. The only difference is that collection operations are eager and sequence operations are lazy, which I guess is the whole purpose of the two types of data sources.

And what does Set have to do with it? Or am I missing something?

2 Likes

this is actually good point, I missed that distinct in first case returns list, not set. Cool story.