Kotlin regex force start at index


#1

I am aware of https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/-regex/find.html
which says: regex must start at OR AFTER the start index

I want a way to say: regex MUST start at PRECISELY start_index.

I have tried the following:

fun main (args: Array<String>) {
    val s = "1a2"
    val r1 = Regex("""\d+""")
    println( r1.find(s, 0) )
    println( r1.find(s, 1) )
    println( r1.find(s, 2) )

    println("=====")
    val r2 = Regex("""^\d+""")
    println( r2.find(s, 0) )
    println( r2.find(s, 1) )
    println( r2.find(s, 2) )

}

which returns: match, match, match (for r1)
and match, null, null (for r2)

I want something that returns (match, null, match)

I want to say: match \d+, but you must start PRECISELY at start_index.

this would mean 1a2 --> matches when index = 0, 2; fails when index = 1 (wince start is ‘a’)

Is this possible? To say: search for regex starting PRECISELY at start_index

I tried “^” hoping it matches "start_index, but it looks like it is hardcoded to be start-of-line

Thanks!


#2
    println( r.find(s.substring(0)) )
    println( r.find(s.substring(1)) )
    println( r.find(s.substring(2)) )

is ok ?


#3

or compile your regex with

    println( Regex("""^.{0}\d+""").find(s) )
    println( Regex("""^.{1}\d+""").find(s) )
    println( Regex("""^.{2}\d+""").find(s) )

#4

I failed to state: I am using regex to tokenize a very long string. Neither of the above works for the following reasons:

substring:
may be constantly copying over string, per token tokenized

recompiling regex:
creates a new regex per token tokenized


#5

If your use case is tokenization, maybe https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.text/split.html is what you are looking for?


#6

Split would be great if my tokenization was context-free. Unfortunately, my tokenization is context sensistive. The tokeinization is something like:

  1. there is a current “state”
  2. this state defines a list of valid regexs to try
  3. depending on which regex we match on, we go to a new “state”
  4. … and so forth …

split would require that there be no state, and that all regexs be valid to use at all times


#7

not sure if you really need it… still

^ means start of line…
maybe use a negative lookbehind on \d : (?<!\d)\d+


#8

Javas java.util.regex.Matcher (created from a java.util.regex.Pattern) seems to have a bit more utility than Kotlins Regex. Link: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html


#9

@Varia : I will look into Java/Matcher. Thanks!