Most efficient data structure for serialisation to disk?

Hi

First day with Kotlin, and I’m hoping you kind folks can save me a few hours of research…

I need to serialise large hashmaps of trading bars to disk - many millions of records - for reading into a backtesting engine. The maps will be keyed with a Long timestamp. It’s a write-once, read often operation.

I’ll be running some benchmarks, but the FastBinaryEncoding library looks promising for the serialization, unless anyone has any better suggestions?

https://chronoxor.github.io/FastBinaryEncoding/

The records contain a number of basic data types for prices, volume etc. Ideally also enums, if the cost is not too high, but I could easily use a numeric index instead.

The records are immutable and dumb data containers. I don’t need any of the facilities of the Kotlin Data Class, and fear it is too heavy for this use-case.

In a system language I would be using a bare tuple or struct for something like this, but as a newbie I’m not at all clear how I should tackle this in Kotlin from the point of view of maximising read performance.

Any advice that would short-cut the learning process would be much appreciated!

Do you plan to read the whole database at once and then use the data from the memory? Or do you need a way to read the records directly from the file, seeking through it?

I would be cautious when reading performance comparison on FBE website. It is hard to believe that it is 10 times faster than Protocol Buffers. They should provide separate benchmarks for different languages as it is possible it is faster in C, but it has the same performance in Kotlin/Java.

Also, this isn’t a very popular project. 500 stars on GitHub comparing to 50k stars of protobuf. I don’t say stars are a direct indication of quality, but I think this is important.

When searching for a solution, please remember that you don’t need to search specifically for Kotlin. It doesn’t look like you need Kotlin support and it is much more probable that you will find something useful when searching for java.

2 Likes

Thanks for the response.

I have the luxury of being able to read all the data into memory, so don’t need to worry about streaming it in.

I’ve used Protobuf in the past, and was planning to benchmark it, and as you say, I don’t have to restrict myself to Kotlin native.

But to be honest, my main issue isn’t the serialisation - that’s a track I’ve been around often enough before. It’s the lack of any obvious lightweight data structure like a struct or tuple that has me a bit baffled.

If anyone can suggest the most economical way to represent dumb data for serialisation, I’d sure appreciate the help!

I never really struggled with such a problem, so I don’t have any experience with it. But as you said, I believe Java does not provide us with any memory-efficient way to store tens/hundreds of millions of objects. At least not before value objects (JEP 169) will be completed.

The only possible trick I see here is to create a big primitive array and store all records in it (so single array, not an array per record). If you can keep a fixed length of records, it should be pretty easy to implement. Still, mapping them by timestamp seems tricky. If they are ordered by this timestamp then maybe a binary search would be sufficient instead of a hash map?

Also, did you consider using embeddable databases like H2, SQLite, HSQLDB or Apache Derby? Your problem seems like a something that fits well into their tabular, relational nature. And they’re pretty efficient CPU- and memory-wise. Maybe some of them even support working in-memory, but opening and closing from/to a file.

Unfortunately I need to use a number of different primitive types, so any solution with arrays is going to be too complicated for my taste.

I’ve not made a trading engine before, but the consensus of people who have tried is that simple serialised files are the best solution for data management. Most of the low end commercial solutions use files as well. The high end institutions use databases, but the kind that cost north of $100k a year, so that’s a different world.

Seems that I’ll just have to suck it and see with the Data Class. I just wanted to make sure I wasn’t overlooking something obvious. A lot of people make performant trading software with Java, so maybe I’m over-estimating the performance penalty of using a full-blown class for this. Some benchmarking will tell me soon enough, I guess.

It really depends if you mean CPU-penalty or memory-penalty. CPU penalty of using objects is… well, none? If you need to iterate over such data structure then I guess an array of primitives would better utilize a CPU cache than array of objects. But other than that I don’t see why objects might decrease the performance.

Memory is another thing and in such use cases it may be a problem for two reasons. First, Java has quite a big memory overhead for large number of small objects. Usually, the overhead is 16 bytes per each object. Second, if you allocate and deallocate objects very frequently, garbage collector will often freeze your application (maybe it was improved as recent versions of openjdk often mention GC improvements). But while you mostly read the data, I guess it should not be a problem.

I’m more concerned about hoisting data off the disk and getting it into memory. Seems to me that a full-featured object is going to take up more disk space which makes it slower to read, and also take more cycles to de-serialize. I’m going to be doing that for millions of bars per run, so it’s a potential bottleneck.

The JVM memory penalty is not ideal, but I could live with that if things are running quickly. Developing for the JVM is going to take much less time than using a gnarly systems language.

But as I say, I guess the only way I’ll truly find out how it works is to give it a try if I can find time tomorrow. I just wanted to make sure that using a Data Object was the best way to go before I made a prototype.

I’m not sure why serializing an object would take more disk space than e.g. a tuple. You serialize data of the object, not the object itself.

I think you should first measure the performance of simple solutions and optimize only if needed. I just created a simple benchmark: a data class with integer and long fields, I serialized 10 000 000 such objects to a disk, using a naive DataOutputStream implementation and another implementation with built-in serialization to protobuf. First took 480ms, second 1980ms. Disk may be a bottleneck here as files to write were ~120MB. Which is what we expected: (4B + 8B) * 10_000_000 = 120MB. The code is below if you’re interested:

@Serializable
data class Item(
    val index: Int,
    val value: Long,
)

fun main() {
    val items = (0 until 10_000_000)
        .map { Item(it, it.toLong() * 2) }

    val time1 = measureTimeMillis {
        File("raw.bin")
            .outputStream()
            .buffered()
            .let { DataOutputStream(it) }
            .use { out ->
                items.forEach {
                    out.writeInt(it.index)
                    out.writeLong(it.value)
                }
        }
    }
    println(time1)

    val time2 = measureTimeMillis {
        val data = ProtoBuf.encodeToByteArray(ListSerializer(Item.serializer()), items)
        File("protobuf.bin").writeBytes(data)
    }
    println(time2)
}

Brilliant - thank you! That will speed up my prototype.

Actually, as I’ve said, it’s read time that’s the issue - write time is once only and a side-concern.

But your timings are very promising. As I’ve been saying, often with these issues the only way to resolve them is to try.

Was just watching a series on language speed by a retired senior Microsoft developer. Someone who started in the days of Assembly and really knows his low-level stuff. He tried an optimisation with C++ that “shouldn’t” work. Huge speed up. If even guys like that can be surprised - us humble souls just have to run the benchmarks :upside_down_face:

PS - one trick I learned from another community is that zipping large binary files before saving can lead to surprising improvements in read performance, rather than streaming direct to disk.

1 Like

Deserializing takes just a little longer: 1500ms for DataInputStream and 2300ms for protobuf. If I run it twice in a single run, DataInputStream decreases to 880ms. Anyway, results on a production could be even better as JVM optimizes the code over runtime.

Of course, timings will be bigger for your real case as there will be more fields/properties. But it should increase more or less linearly.

val (items3, time3) = measureTimedValue {
    File("raw.bin")
        .inputStream()
        .buffered()
        .let { DataInputStream(it) }
        .use { input ->
            (0 until 10_000_000)
                .map { Item(input.readInt(), input.readLong()) }
        }
}
println("items3[5000]: ${items3[5000]}")
println(time3)

val (items4, time4) = measureTimedValue {
    val data = File("protobuf.bin").readBytes()
    ProtoBuf.decodeFromByteArray(ListSerializer(Item.serializer()), data)
}
println("items4[5000]: ${items4[5000]}")
println(time4)
1 Like

@Scotpip please consider to use JMH for you tests.
I suggest you to perform a benchmark using a compressed file, you can try with GZipInputStream and remove the buffered() filter. (GZipInputStream is already buffered)

2 Likes

Also check mapDB project on GitHub. It’s also written in kotlin