String.padEnd(length: Int) and multi code point characters

Wasabi375 · January 30, 2020, 11:52pm

I’m guessing your on the JVM. This is a problem with the java String implementation. While it has no problems saving multi codepoint characters many of the utility functions don’t handle them correctly.
I’m not sure about how complicated it would be to provide an alternate implementation.
The problem is that the java String class represents characters as UTF-16 characters. This means any unicode character that is represented by more than 16 bits is saved as 2 separate Char values. This fact is ignored by many of the functions within String, eg. String.lenght does not return the number of unicode characters, it returns the number of 16bit characters within the String, some emoji counting for 2 characters. If your string also contains invisible characters like the unicode characters responsible for emoji skin color String.lenght might count a single displayed character as having a length of 4 or in some other combinations as even longer.
There was an iPhone bug with a similar problem in 2015 that lead to your phone crashing when you recieved a special arabic text:

I don’t know of any workaround. It probably requires a complete reimplementation of String using a completely different memory system. Chars aren’t the best for manipulating strings outside of plain text. Maybe not use emoji

Topic		Replies	Views
How to get String Codepoints in Multiplatform Multiplatform	3	2056	March 6, 2025
Should Kotlin support strings as Unicode sequences instead of UTF-16?	5	8458	October 5, 2020
How to print emojis using kotlin compiler Support	6	96	January 3, 2026
How do I get the collection of code points from a String Libraries	1	2084	November 2, 2016
Working with Unicode characters outside the BMP Support	1	1117	October 3, 2022

String.padEnd(length: Int) and multi code point characters

Related topics