Convert a String into right charset or detect String's charset

Psijic · December 10, 2020, 8:20am

Hello. I have input string parameters in ISO-8859-1 charset like "ÏÀÎ Ñáåðáàíê ã. Ìîñêâà" while the right value should be Cyrillic "ПАО Сбербанк г. Москва". Currently I can convert it this way:

val w1251: Charset = charset("Windows-1251")
val csISOLatin1: Charset = charset("ISO-8859-1")
println (codeWin.toByteArray(csISOLatin1).toString(w1251))

But the encoding can be any other charset, so I need to find a method to make my conversion correct for any case. Currently I’m trying to play with this code:

Charset.availableCharsets().forEach {
    try {
        println("${it.key} = " + String(codeWin.toByteArray(it.value), w1251))
    } catch (e: Exception) {
        println("Error: $e")
    }
}

But I have no key points how to detect which result is correct here (maybe the way can be if I try to detect necessary symbols of the charset).

Also is it possible to get a String’s charset? I found some libraries here but don’t want to include external solutions.

gidds · December 10, 2020, 9:28am

In general, I don’t think this is possible. Any sequence of bytes would be valid in ISO-8859-1, Windows 1251, Windows 1252, or any other 8-bit encoding, so there’s no way for the computer to tell.

(This is different from UTF-8, in which many sequences are invalid; to the point where if a sequence of bytes would be valid UTF-8, then that’s almost certainly what was intended.)

As Joel Spolsky says: It does not make sense to have a string without knowing what encoding it uses.

Psijic · December 10, 2020, 10:54am

There is a code worked well in Swift for this case. Is it possible in Kotlin?

if let data = str.data(using: .windowsCP1251){
    print(String(data:data, encoding: .windowsCP1251))
}

if let data = str.data(using: .isoLatin1){
    print(String(data:data, encoding: .windowsCP1251))
}

Wasabi375 · December 10, 2020, 10:56am

There are some ways to guess the encoding if you have some knowledge about the language the text is in.
You can try out different encodings and then do a frequency analisys. This means that you count how often each character is the the text. You can then compare this with a typical frequency analisys of your target language, eg for english you would expect to see the letter “e” more often than the letter “z”. Letter frequency - Wikipedia
Att the end you can use the encoding that best matches the frequency of your language. You will never get a perfect match but it should be close.

Unless you need to interface with old hardware (or have some other strange restrictions) you should look into using unicode for any kind of international text, that means using UTF-8, UTF-16 or UTF-32 encodings. UTF-8 is pretty much the standard for saved files while kotlin strings use UTF-16 internally.

Psijic · December 10, 2020, 1:23pm

It’s a shame there is no way to get a right charset. I made my own converter:

private val cyrillicRange: (Char) -> Boolean = { char ->
    char.toInt() in 0x0020 until 0x00AF || char.toInt() in 0x0400 until 0x04FF
}

    fun checkEncoding(value: String): String =
        if (!value.all(cyrillicRange)) checkCyrillicEncoding(value) else value

private fun checkCyrillicEncoding(value: String): String {
    val charsets = listOf(
        StandardCharsets.UTF_8,
        StandardCharsets.ISO_8859_1,
        StandardCharsets.US_ASCII,
        StandardCharsets.UTF_16,
        StandardCharsets.UTF_16BE,
        StandardCharsets.UTF_16LE
    )

    charsets.forEach { charset ->
        val encoded = String(value.toByteArray(charset), charset("Windows-1251"))
        if (encoded.all(cyrillicRange)) return encoded
    }
    return value
}

Any suggests how it could be improved? I think, at least cyrillicRange could be optimized

jstuyts · December 10, 2020, 1:57pm

A String is always in the Unicode “character set”.

On the JVM it is stored in memory using the UTF-16 encoding. (I assume, but don’t know, it is stored like this on all platforms supported by Kotlin, because this would make it easier to implement the Kotlin String API there.) Because Unicode had less than 65.536 characters when Java was designed, the API is a bit awkward now: you can get the units of the UTF-16 encoding (Char) or the actual Unicode code points (Int). Just like you had multi-byte character sets (MBCS) in the old days, String on the JVM is encoded as a multi-UTF-16-unit code point set in memory now.

The first thing you should do is stop converting byte arrays to Strings and then trying to correct the encoding. Depending on the settings used when decoding the byte array, you may:

loose unrecognized code points, i.e. the string misses characters that were in the original text,
have unrecognized code points replaced with another string, i.e. the string has placeholders for some of the characters in the original text, or
get an error.

The correct way to convert byte arrays to strings is to:

Using the byte array, determine the character set that was used to encode the original string. If possible, this character set should be known (e.g. JSON) or explicit (e.g. XML), so you do not have to guess. If the character set used to encode the original string is not specified, you can use heuristics to guess 1 or more character sets.
Use the character set found in the previous step, to decode the byte array.
If a failure occurs (make sure you use the correct java.nio.charset.CodingErrorAction), and you have multiple candidate character sets, try the next one.

Note that the above is for an ideal world where everybody correctly converts between byte arrays and strings. Unfortunately this is not always the case.

gidds · December 10, 2020, 11:22pm

What jstuyts said!

With the proviso that if your program has to use heuristics (i.e. guess), sooner or later it’ll guess wrong.⠀And the effects of that could be subtle enough that you don’t notice for a while.

Far better to know (hence the Joel Spolsky reference).⠀That usually means that you must either have pre-arranged the character encoding, or have a way to indicate it separately from the bytes themselves.

(As I said, the only common case where you can detect the encoding with reasonable confidence is UTF-8.⠀But even that’s not 100%.)

vhodek · December 15, 2020, 10:07am

I would definitely recommend you to always use UTF-8 or Unicode or something similar that is able to handle also characters missing in other alphabets. Because sooner or later you get into “mostly Latin but with a short Arabic sentence” or similar.

However, the world is not perfect and I was already in a similar situation:

Apache Tika can do a lot of things: CharsetDetector (Apache Tika 1.3 API)
I would recommend, based on my own experience: Google Code Archive - Long-term storage for Google Code Project Hosting.

jstuyts · December 15, 2020, 10:48am

Do not use UTF-8 and Unicode in the same way. They are different:

Unicode is a specification that maps characters (plus emojis, special instructions, etc.) to numbers: Unicode code points. But it does not specify how those code points are encoded as bytes.
UTF-8 specifies how the Unicode code points are encoded as bytes. UTF-8 can encode all Unicode code points, i.e. no loss when converting to and then from bytes.

It is preferable to use an encoding that can encode all Unicode code points, and UTF-8 is the de facto standard.

Some languages store Unicode strings as UTF-8 in memory. Others (like Java) use UTF-16. You should get to know how the language you are using represents code points in memory. Depending on the use case you may have to handle bytes (very unlikely), UTF-8 values (very unlikely), UTF-16 values (sometimes you have to in Java because Unicode grew after the decision to use 16 bits per code point was made, and you may have to handle code points greater than 65,535), or code points.

If you don’t do any text encoding, decoding or processing, but just want to output some text, you usually do not have to worry about it. You can simply use them as Unicode strings.

vhodek · December 15, 2020, 11:06am

You’re completely right! I wanted to write UTF-16 instead of Unicode, of course.

Topic		Replies	Views
Convert String to ByteArray with different encode using kotlin and java but got error result	3	3324	September 17, 2019
Should Kotlin support strings as Unicode sequences instead of UTF-16?	5	8246	October 5, 2020
Convert String to ByteArray and then back to original String	4	45410	October 31, 2017
Convert Char to Byte gives wrong result Support	5	3356	February 12, 2019
Illegal escape: ''0''	1	3453	April 8, 2015

Convert a String into right charset or detect String's charset

Related topics