I have written the following function to find a histogram median. The loop with an external running total looks ugly compared to other more kotlin-esque code. Is there a way to make it more ‘elegant’ without compromising performance?

I suspect that there is a function on collections that might do this, but I am not sure which. I would like to be told of it and how to use it… that is the way I learn.

/**
* Assuming that the receiver is a histogram where the value at each index is the count of
* occurrences of the index in the data set, return the median value of the dataset.
*/
public fun IntArray.findHistogramMedian(): Int {
val midCount = sum() / 2
var total = 0
forEachIndexed { index, i ->
total += i
if (total >= midCount) return index
}
return lastIndex
}

Thanks to MikibeMiki for suggesting foldIndexed. I have never used folding before, perhaps because I have not perceived a need… A lot cleaner, and no slower:

/**
* Assuming that the receiver is a histogram where the value at each index is the count of
* occurrences of the index in the data set, return the median value of the dataset.
*/
public fun IntArray.findHistogramMedian(): Int {
val midCount = sum() / 2
foldIndexed(0) {index, total, count ->
if (total >= midCount) return index
total + count
}
return lastIndex
}

fun IntArray.findHistogramMedian2(): Int {
val sums = runningReduce(Int::plus)
val midCount = sums.last() / 2
return sums.indexOfFirst { it >= midCount }
}

It is shorter and at least to me easier to read and we don’t have to jump out of the fold operation. However, it creates an additional list.

Also, I’m not sure if this is the correct way to calculate the median. Note that for e.g. [1, 1, 1] it returns 0 which is clearly incorrect. I think you need to handle two cases separately: odd and even number of items.

Thanks Broot, Yes that is certainly shorter and clearer, but I worry about the extra list. It is pretty much what I did first. This code is invoked in a median filter in an image so the code is best being as fast as we can make it.

You may also be right about the odd/even in medians. I think that the median in a set of numbers of even size can fall between two dissimilar values and may not be an integer… But I need to assign the value to an integer array, so am going to have to round somewhere somehow. Seems to work better (and faster) than a gaussian in its present form but I am always open to doing things better.

Though you do have a point with your example @broot. Thanks. Dividing 3 by 2 gives 1, which then matches the zeroth index.
Either I need to divide by 2.0 to get 1.5 … matching index 1. Or I need to use the inequality operator.

Yes that is certainly shorter and clearer, but I worry about the extra list

As in many cases, you’ ll have to tradeoff between memory and CPU : either you traverse the list twice and compute the sums twice, or you do it once and save stuff in memory to speed up afterwards. For the later, an improvement would be a dichotomy search on the sums array, since it is in ascending order by construction, which will give n+log2(n) instead of 2n. But I guess you’ll have to measure against your typical datasets, because the multiplicative constants for each term will probably make the difference.