Kotlin regex force start at index

I am aware of find - Kotlin Programming Language
which says: regex must start at OR AFTER the start index

I want a way to say: regex MUST start at PRECISELY start_index.

I have tried the following:

fun main (args: Array<String>) {
    val s = "1a2"
    val r1 = Regex("""\d+""")
    println( r1.find(s, 0) )
    println( r1.find(s, 1) )
    println( r1.find(s, 2) )

    println("=====")
    val r2 = Regex("""^\d+""")
    println( r2.find(s, 0) )
    println( r2.find(s, 1) )
    println( r2.find(s, 2) )

}

which returns: match, match, match (for r1)
and match, null, null (for r2)

I want something that returns (match, null, match)

I want to say: match \d+, but you must start PRECISELY at start_index.

this would mean 1a2 → matches when index = 0, 2; fails when index = 1 (wince start is ‘a’)

Is this possible? To say: search for regex starting PRECISELY at start_index

I tried “^” hoping it matches "start_index, but it looks like it is hardcoded to be start-of-line

Thanks!

    println( r.find(s.substring(0)) )
    println( r.find(s.substring(1)) )
    println( r.find(s.substring(2)) )

is ok ?

or compile your regex with

    println( Regex("""^.{0}\d+""").find(s) )
    println( Regex("""^.{1}\d+""").find(s) )
    println( Regex("""^.{2}\d+""").find(s) )

I failed to state: I am using regex to tokenize a very long string. Neither of the above works for the following reasons:

substring:
may be constantly copying over string, per token tokenized

recompiling regex:
creates a new regex per token tokenized

If your use case is tokenization, maybe split - Kotlin Programming Language is what you are looking for?

Split would be great if my tokenization was context-free. Unfortunately, my tokenization is context sensistive. The tokeinization is something like:

  1. there is a current “state”
  2. this state defines a list of valid regexs to try
  3. depending on which regex we match on, we go to a new “state”
  4. … and so forth …

split would require that there be no state, and that all regexs be valid to use at all times

not sure if you really need it… still

^ means start of line…
maybe use a negative lookbehind on \d : (?<!\d)\d+

Javas java.util.regex.Matcher (created from a java.util.regex.Pattern) seems to have a bit more utility than Kotlins Regex. Link: Matcher (Java Platform SE 8 )

1 Like

@Varia : I will look into Java/Matcher. Thanks!

I agree with many here. It’s not nice to be forced to use Java/Matcher just for that reason. Using substring combined with “^” in the regex it just a cheap trick (with an useless string handling), and you still have to fix the positions of MatchResult.range