Large counting map that needs to be trimmed to not run out of memory?

tl/dr: Best practice for a very large Map<String,Int> counts to not run out of memory?

I’m counting a large number (30gb of CSV files) of english n-grams from some files from the Google Book project, and thanks to Kotlin, it is very easy to read the files, split into data, and count up n-grams, and get the top 5000 by count. Yay!

However, when I try to run this on ALL files, it runs out of memory because the counting map gets huge – too many keys. So I was wondering: Is there either a better data struct for this that takes a fraction of the memory, or a better way to trim out the low counts every few hundred thousand map edits? Even with 20gb of memory this quickly starts to get cranky, and I feel like it shouldn’t…

My initial thought was to override the map setter, to self-trim every so often, but that seemed… hacky.

val justWords = "[a-z][a-z']+ [a-z][a-z']+ [a-z][a-z']+".toRegex(RegexOption.IGNORE_CASE)
val counts = File("data").walk().filter { file ->
    file.isFile // ignore directories
}.flatMap { file ->
    println(file.toString())
    file.inputStream().bufferedReader().lineSequence()
}.map { line ->
    line.split("\t") // ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE
}.filter {
    it[1].toInt() >= 1970 // publication year
}.filter {
    justWords.matches(it[0])
}.take(100_000_000).groupBy { it[0] }.mapValues { (_, line) ->
    line.sumBy { it[2].toInt() }
}.toList().sortedWith(compareBy({ -it.second }, { it.first })).take(5000).toMap()

It is hard to understand, what actually happens inside those chain calls, but I think the problem is that Sequence interface you are using is not guaranteed to be actually lazy, it could accumulate results somewhere in between and not free it. I would try to do the same using Java 8 Stream with very similar syntax instead. They are guaranteed to be lazy and therefore will probably free memory in between (I am not sure though).

A much simpler version would be

  1. For all 3-word trigrams in all books
  2. Keep a count of the times the trigram was seen

But #2 blows up because the possible number of trigrams = 5^15, so I need some way to say “And occasionally drop the low counts so I don’t run out of memory half-way.”

If you actually need to store huge amounts of data (I’ve assumed that it is just the GC problem), smart data structures won’t help you. You will have to limit the amount of data or eventually dump some intermediate data on the hard drive.
Something could be done by using ByteBuffer or other buffers as intermediate storage, but I do not think it is good solution in your case.

I’ve tried to understand what the code does here. The last line is clearly wrong:
toList().sortedWith(compareBy({ -it.second }, { it.first })).take(5000).toMap(). You create a huge in-memory list, then sort it and then take some entries and copy it to map. You should use sortedBy method in sequence and then convert the sequence to map directly.

groupBy is not lazy. I believe you can use groupingBy for that.

1 Like

Yes please. I read How to sum multiple elements after grouping in Kotlin - Stack Overflow but wasn’t sure how to use it here.

.take(500_000_000).map { line ->
    val (ngram, year, match_count, _, _) = line.split("\t")
    Triple(ngram, year.toInt(), match_count.toInt())
}.filter { (ngram, year, _) ->
    year >= 1970 && !ngram.contains('.') && !ngram.contains('"')
}.groupingBy { (ngram, _, _) ->
    ngram
}.aggregate { (ngram, _, ???, _) ->
    ???
}

I can’t test right now, but i think this does what you want: .groupingBy { it[0] }.eachCount()

Really close - I think that would count number of rows with that ngram, whereas I’m looking for (pseudocode) .groupBy { it[0] }.eachCountSumOf(it[2])

.groupingBy { it[0] }
.aggregate<List<String>, String, Int> { key, accumulator, element, first ->
	(accumulator ?: 0) + element[2].toInt()
}
1 Like

It may be more convenient to use fold in this case:

fun main(args: Array<String>) {
//sampleStart
val elements = sequenceOf(Triple("AAA", 1979, 12), Triple("ABA", 1999, 10), Triple("AAA", 2000, 15))

elements
    .groupingBy { (ngram, _, _) -> ngram }
    .fold(0) { acc, (_, _, count) -> acc + count }
//sampleEnd
.let { println(it) }
}
1 Like

the short anwer : grouping need to load all data if you have no index & want to do it in memory

Sorry to be late to the party on this question. I had a similar problem loading massive log files where I was indexing something. My solution was to implement an imperfect solution where I would first put every entry into a weak map (where the keys would be garbage collected automatically). At some number of entries I would copy the key and its value to a strong map so that that entry was definitely preserved until the end of the counting. In that way the weak map would be periodically trashed by garbage collector, but the common entries would be captured in the strong map for the result.

For my purposes, it was OK to make mistakes. For yours, you would need to consider whether a very odd distribution of n-grams in your corpus would be at all likely and invalidate your answer.

You might instead want something like an approximate histogram or similar “big data” approach: Effective Management of High Volume Numeric Data with Histograms - Circonus, A collection of links for streaming algorithms and data structures · GitHub

I don’t know much about them but I know they exist.