[Feature request] Regex findAll with overlap

I ran into an annoyance today where findAll did not match overlapping parts. This is sane, to be fair. But having alternative behavior would be nice.

Can we get a findAllWithOverlap defined as follows?:

fun findAllWithOverlap(input: CharSequence, startIndex: Int = 0): Sequence<MatchResult> {
        if (startIndex < 0 || startIndex > input.length) {
            throw IndexOutOfBoundsException("Start index out of bounds: $startIndex, input length: ${input.length}")
        }
        return generateSequence({ find(input, startIndex) },  { find(input, it.range.first + 1) })
    }

Making a pull request seems like a pain since I would need to write the JS impl as well (maybe, idk how the kotlin repo works) so I thought id just take away most of the work and the core team can decide.

3 Likes

I suppose you’ve worked on this year advent of code then ?

2 Likes

I believe this function is not necessary, because we can easily “enable overlapping” by using look-ahead in regex.

Example

Suppose we have haystack string: 1234 5678

Using a simple regex \d\d\d on the haystack would give us two matches 123 and 567.

By simply wrapping the regex with a positive-lookahead, plus a capturing group, i.e.
(?=(\d\d\d))

this would then return 4 matches, where each group 1 would contain the digits, giving us
123, 234, 567 and 678

Example in Kotlin

Code (playground):

println(
    Regex("\\d\\d\\d")
        .findAll("1234 5678")
    	.map{it.groupValues[0]}
        .joinToString()
)
  
println(
    Regex("(?=(\\d\\d\\d))")
        .findAll("1234 5678")
    	.map{it.groupValues[1]}
        .joinToString()
)

Execution result:

123, 567
123, 234, 567, 678