Working with Unicode characters outside the BMP

Two related but different questions:

  1. Is there a standard way to represent a Unicode character in Kotlin? The type Char does not represent a Unicode character, but rather a UTF-16 token, which can be either a Unicode character from the BMP (which is only a subset of all Unicode characters) or one part of a UTF-16 surrogate pair (which is not a Unicode character at all).

  2. Is there a way to query the Unicode properties (such as category) of arbitrary Unicode characters? The extension property Char.category from kotlin.text, for example, only works for Chars, so it does not apply to characters outside the BMP.

Not really. If you’re running on the JVM then strings will be UTF-16 encoded, and you can use String.codePointAt() to get the full code point as an Int, but that doesn’t work on other platforms.

Char.isSurrogate() works on all platforms, but isn’t very helpful.