Should Kotlin support strings as Unicode sequences instead of UTF-16?


#1

Hi,   While I appreciate that Kotlin's Java heritage will make it hard to really get away for char/UTF-16, I was wondering if it would be a good idea to have Kotlin support and encourage more correct use of Unicode.

I know it’s must easier to just support Strings in the same way that Java does, the whole UTF-16/surrogate pair thing is just pain waiting to happen.
However if strings are going to be directly iterable, can you please consider iterating code points (int) rather than chars ?

Ideally, the string type should just be seen as code-points normally (however its stored underneath) and while there should be ways to see it as UTF-16 (and maybe even UTF-8) it’s be really nice if people just got used to it as a “sequence of Uncode code points” for all normal usage.

Just a thought,
  David


#2

Thinking about it a bit more I'm wondering if strings should be treated sufficiently abstractly to permit multiple underlying encodings. Having some sort of UnicodeString interface that can sit over the top of Java strings (UTF-16) and byte sequences (UTF-8) might be a nice approach. To support random access uniformly, you'd want to do something interesting with indices, but something like this might work:

var str = UnicodeString(“abcuD950uDF21efg”)
// str.length() == 7 (there a 7 code points, not 8)

var idx = str.indexOf(‘e’)
// idx.size() == 4 - it represents 4 code points from the start of the string
// idx is not an int, it is something like a ‘CodePointIndex’ - you can do some limited index arithmetic with it, but not things like * or /.

var newStr = str.substring(0, idx)
// newStr = “abcuD950uDF21”, splitting at char index 5 to give the 4 code points.

Then you could have a UTF-8 encoded string based on a underlying byte[] representing the same code-point sequence.

val bytes = array<Byte>(10, ‘a’, ‘b’, ‘c’, 0xF1, 0xA4, 0x8C, 0xA1, ‘e’, ‘f’, ‘g’)

var str = UnicodeString(bytes)

var idx = str.indexOf('e') var newStr = str.substring(0, idx) // newStr splits at byte index 7 to give the same 4 code points as above.

Now you get to implicitly turn UTF-8 data into strings without having to inflate it into chars.

It’s certainly not trivial and you still have the issue that there are plenty of existing Java char functions that don’t respect code-points correctly, but I think it could be made to work.


#3

If we make Kotlin strigs differ from Java strings, all the interoperability will go to hell.

It looks like this could be an opt-in thing: a bunch of library classes