Should Kotlin support strings as Unicode sequences instead of UTF-16?

Hi,   While I appreciate that Kotlin's Java heritage will make it hard to really get away for char/UTF-16, I was wondering if it would be a good idea to have Kotlin support and encourage more correct use of Unicode.

I know it’s must easier to just support Strings in the same way that Java does, the whole UTF-16/surrogate pair thing is just pain waiting to happen.
However if strings are going to be directly iterable, can you please consider iterating code points (int) rather than chars ?

Ideally, the string type should just be seen as code-points normally (however its stored underneath) and while there should be ways to see it as UTF-16 (and maybe even UTF-8) it’s be really nice if people just got used to it as a “sequence of Uncode code points” for all normal usage.

Just a thought,
  David

Thinking about it a bit more I'm wondering if strings should be treated sufficiently abstractly to permit multiple underlying encodings. Having some sort of UnicodeString interface that can sit over the top of Java strings (UTF-16) and byte sequences (UTF-8) might be a nice approach. To support random access uniformly, you'd want to do something interesting with indices, but something like this might work:

var str = UnicodeString(“abcuD950uDF21efg”)
// str.length() == 7 (there a 7 code points, not 8)

var idx = str.indexOf(‘e’)
// idx.size() == 4 - it represents 4 code points from the start of the string
// idx is not an int, it is something like a ‘CodePointIndex’ - you can do some limited index arithmetic with it, but not things like * or /.

var newStr = str.substring(0, idx)
// newStr = “abcuD950uDF21”, splitting at char index 5 to give the 4 code points.

Then you could have a UTF-8 encoded string based on a underlying byte[] representing the same code-point sequence.

val bytes = array<Byte>(10, ‘a’, ‘b’, ‘c’, 0xF1, 0xA4, 0x8C, 0xA1, ‘e’, ‘f’, ‘g’)

var str = UnicodeString(bytes)

var idx = str.indexOf('e') var newStr = str.substring(0, idx) // newStr splits at byte index 7 to give the same 4 code points as above.

Now you get to implicitly turn UTF-8 data into strings without having to inflate it into chars.

It’s certainly not trivial and you still have the issue that there are plenty of existing Java char functions that don’t respect code-points correctly, but I think it could be made to work.

If we make Kotlin strigs differ from Java strings, all the interoperability will go to hell.

It looks like this could be an opt-in thing: a bunch of library classes

1 Like

In iOS world they did NSString working differently than String and interoperability is not needed. But they did it from the beginning. So starting from Swift 1 it was working like that. Now it seems to be to late for Kotlin :confused: I have huge troubles to iterate over glyphs in Kotlin now like family icon :family_man_woman_girl_girl:. In Swift it’s so easy.

Yeah, I ran into a similar problem a while ago. The problem is that many emoji aren’t just a single unicode codepoint but actually mutliple ones. That family icon is something like man + woman + girl + girl.
I ended up using antlr with this grammar file. It’s probably not the best solution but it works in my situation (small hobby project).
I was thinking about maybe creating a proper library that adds full unicode support to kotlin but I’m not sure I have the time and I also don’t really know to where start. Maybe in a year or so, who knows.

Unicode is quite complicated, and while a string made up of codepoints instead of UTF-16 code units would make certain operations simpler, it still wouldn’t help with combining characters or any of the other issues that most programmers don’t really know.

In the particular case of emojis with combining characters, the solution is to use BreakIterator.getCharacterInstance() to get an iterator that returns a sequence of grapheme clusters, which is what most people are looking for when they are trying to iterate over characters.

The problem for me was that this class is part of Java and my project is multiplatform. I had to link ICU in order to get this functionality in native code. A Unicode library as part of the Kotlin standard library would be useful.

2 Likes