tl/dr: Best practice for a very large Map<String,Int> counts to not run out of memory?
I’m counting a large number (30gb of CSV files) of english n-grams from some files from the Google Book project, and thanks to Kotlin, it is very easy to read the files, split into data, and count up n-grams, and get the top 5000 by count. Yay!
However, when I try to run this on ALL files, it runs out of memory because the counting map gets huge – too many keys. So I was wondering: Is there either a better data struct for this that takes a fraction of the memory, or a better way to trim out the low counts every few hundred thousand map edits? Even with 20gb of memory this quickly starts to get cranky, and I feel like it shouldn’t…
My initial thought was to override the map setter, to self-trim every so often, but that seemed… hacky.
It is hard to understand, what actually happens inside those chain calls, but I think the problem is that Sequence interface you are using is not guaranteed to be actually lazy, it could accumulate results somewhere in between and not free it. I would try to do the same using Java 8 Stream with very similar syntax instead. They are guaranteed to be lazy and therefore will probably free memory in between (I am not sure though).
But #2 blows up because the possible number of trigrams = 5^15, so I need some way to say “And occasionally drop the low counts so I don’t run out of memory half-way.”
If you actually need to store huge amounts of data (I’ve assumed that it is just the GC problem), smart data structures won’t help you. You will have to limit the amount of data or eventually dump some intermediate data on the hard drive.
Something could be done by using ByteBuffer or other buffers as intermediate storage, but I do not think it is good solution in your case.
I’ve tried to understand what the code does here. The last line is clearly wrong: toList().sortedWith(compareBy({ -it.second }, { it.first })).take(5000).toMap(). You create a huge in-memory list, then sort it and then take some entries and copy it to map. You should use sortedBy method in sequence and then convert the sequence to map directly.
Sorry to be late to the party on this question. I had a similar problem loading massive log files where I was indexing something. My solution was to implement an imperfect solution where I would first put every entry into a weak map (where the keys would be garbage collected automatically). At some number of entries I would copy the key and its value to a strong map so that that entry was definitely preserved until the end of the counting. In that way the weak map would be periodically trashed by garbage collector, but the common entries would be captured in the strong map for the result.
For my purposes, it was OK to make mistakes. For yours, you would need to consider whether a very odd distribution of n-grams in your corpus would be at all likely and invalidate your answer.