Generating warnings for Autoboxing?

Hey,

in Java it is possible to generate warnings (even errors) for Autoboxing. It seems like Kotlin does not provide that option. I now often decompile the Kotlin-bytecode to Java and double-check there but this is obviously a hassle.

I also quickly skimmed through some static code analysis tools, but couldn’t find any that provide that feature. Given that the boxing is often quite subtle and Kotlin is also expanding on the server side (where high-performance is needed in special circumstances) I am wondering how others are dealing with this?

(Yes I benchmark all the time, and yes Autoboxing produces almost all of my garbage since I am pooling other intensively used objects).

2 Likes

This is probably something that can be done with IntelliJ intentions. You should create a feature request for this at https://kotl.in/issue.
I could also see this as an compiler warning, but right now there is no way to enable/disable specific warnings compiler wide(KT-8087) and this would be annoying in most projects. So this will have to wait for that to be possible.

1 Like

Did you measure the performance gain without a profiler attached?

Boxing of primitives is the most important Kotlin performance problem. There were discussion about prohibiting autoboxing in some cases, but it is clearly is not simple to implement. There are ways to work around boxing problems. I think for that we need to work on common libraries.

By the way, using GraalVM in many cases allows to avoid boxing overhead,

I should have been more clear in my opening post. The issue is primarily the extensive strain on garbage collection which makes the application output less reliable due to constant pauses (also with ZGC/Shenandoah). Having a tool that assists with avoiding the boxing pitfalls would be a massive help.

Thanks for the link, filed an issue for IntelliJ inspections.

100% agree. Kotlin is less explicit about boxing than Java and does more magic in the background to avoid any type errors between boxed and primitive values. But there should be an option to make this more explicit.

Thank you @tscha,
your proposal is reasonable.
However, please answer to my question, did you measured it.
How much improvement you expect? 5%? 10%? 15%?

Really?
This is not my experience, I would have said KT-16222.

KT-16222 is specific for coroutines and you usually do not use coroutines in performance-crytical parts. I am talking about problems, which could not be simply avoided. In my experience, boxing is almost singular reason for “slowness” of java/kotlin generic algorithms in numeric tasks. Of course it is possible to use non-generic algorithms, but it requires almost as much effort and performance tuning as in native languages.

Hi @darksnake,
this considerations are really relative to the measured software, others can be really affected by KT-21147.

I agree. I personally work a lot with numeric software, so I have my own specific pains.

Again, in my case it’s not about the performance, it’s about the garbage. It is not really important to me how much non-boxing would improve my performance in terms of throughput, so I never measured the relative computing share. I guess it’s negligible. But the garbage that gets produced as a side product of boxing is enormous, since the GC runs every few seconds and pauses the system for like 1ms-2ms. And I can see in the profiler/sampler that pretty much all garbage comes from boxing (as I am pooling and reusing all other frequently used objects).

Of course it’s possible to find these boxing events manually, but it’s surprisingly annoying to do this in Kotlin (compared to Java) since the former is less explicit about boxing. It’s unfortunate because the JVM is a great platform for projects like mine. You can write normal, high-level code for methods that are not called frequently, and focus on speed (pooling, primitives, cache awareness) for the parts where it matters. All without switching to a low-level language while still being close in terms of performance. A good example for that is Rapidoid, which is a Java web framework that is on par with the fastest C++ framework out there.

My experience isn’t quite the same as tscha’s, but does support it.

I used to work on high-throughput systems, and some of the best performance increases came from avoiding temporary objects — not by reducing the processing in that thread, but by reducing the frequency of garbage collections and the processor time spend in them.

(For example, we had many methods that would construct a String by concatenating the results of several other methods — some of which would do the same — generating tons loads of little Strings. So one simple technique that paid huge dividends was to rewrite those methods to accept a StringBuilder parameter and append their results to that instead. Similarly, instead of returning a List or array, a method could append its results to a passed-in List. That saved shedloads of temporary objects — especially when the top-level caller could then reuse the builder/list — and cut the frequency of garbage collections to a small fraction.)

It’s true that creating lots of temporary objects is very fast on the JVM (certainly as compared to non-GC languages) — but some of the saving is then spent in garbage-collecting them. That’s a huge win for most programs, especially when there’s enough idle time to cover it; but can be a lot less so for those that don’t.

So yes, I can see that reducing and/or warning about autoboxing could be very helpful in some cases.

It wasn’t related to “performance”, but I have to assume that you expect some kind of “improvement”.

How did you measured the garbage strictly related to the autoboxing feature?

0.01% overhead does not look so bad.

Did you use JMH, async-profiler or some other broken profiler on cold JVM?

@tscha, I am not against your proposal, it looks reasonable.
But it may point the developer attention on an irrelevant issue. This can be your case.

I agree. 0.1% seems to be a rather regular GC price, and it could not be significantly diminished. The solution for limiting GC time is not avoiding boxing, but using pooling like it is done in kotlinx-io.

I looked at the VisualVM sampler and saw that practically all my delta objects were boxed values (i.e. Double). Not much need to further investigate.

You are still thinking in terms of throughput, as you are just dividing the GC time by the total time. My main concern is not performance in terms of throughput, it is the act of collection itself, because it makes the system unreliable. If an operation that normally takes 4us sometimes takes 2000us, this is bad for my use case. So if I had to choose between

  • a 0.1% performance decrease with unreliable spikes and
  • a 10% constant performance decrease

I’d always take the second one. Obviously, because if performance was really that critical I wouldn’t write it on the JVM in the first place. I just cannot deal with randomly occuring 50,000% time increases.

Anyhow, I found the issue, one function deep down was using the Number interface behind the scenes which caused conversion. Now the GC just has to collect objects that are created once in a while, instead of collecting millions of needless boxed objects that have no use. And without GCs and a thread pinned to a core, jitter will hopefully be minimal (though I still have to do tests on that).

I went down exactly the same route. First I rewrote my methods to just create lists/arrays once and then pass them to other methods as a parameter (and yes, that made a noticeable impact on performance, but arrays and lists are also more expensive to create than boxed primitive objects - especially when you need to create them 5-10 times for each run). Now I am also pooling the arrays themselves, so no arrays get created at all (except at the start of the application). So I put the primitive data in pooled objects, pull an array out of my array-pool and just assign the objects to the respective indices.

Sometimes things can be tricky. For example, I had to put objects into a sorted set. I often iterated over this set. Java’s Treeset is fast, but it creates temporary objects when iterating (!?) over the set. The Fastutil package on the other hand has implementations that let you iterate for free. However, even the Fastutil package creates Entry objects when adding/removing objects to the set (which I also did quite often). So I had to modify their implementation to hold an EntryPool and put entry objects back into this pool after the underlying value has been removed from the set. When adding, you just reuse an old entry object and assign your value to it.

I disagree. There are some use cases that do not want a “regular GC price”. What they want is a GC price that is much lower than regular. Which is reasonable and absolutely possible - there are high performance JVM systems running garbage collections once a day. Kotlin just makes it needlessly hard to develop them.

These two things are orthogonal. You cannot pool boxed objects because they are immutable. You use pooling for mutable objects, which hold either other pooled mutable objects, but mostly primitive values. Boxing happens when you then call functions on these pooled objects and the functions sneakily convert your precious primitives into immutable throwaway objects.

Just curious, do you have any formal latency/response time requirment, like “p95/p99 less then N ms”?

Yes, measuring in percentiles makes a lot of sense in this use case. But I do not have any specific requirements.

In my case, we didn’t have any fixed targets, but we were processing real-time feeds of incoming data. So we had to be fast enough to read all the messages, do all the necessary transforming and processing and calculations, update internal data stores and external DBs, generate our own update streams to send out to other systems, &c, without dropping any messages — and still have time to respond to queries from other systems. When the incoming feeds were slow, it was easy — but at busy times when you’re reading 10,000s of messages/sec for 10 mins at a time, you really need the system to keep up.

When I started, some systems would chew through a 1.5GB heap every couple of seconds and then have to pause everything for a full GC. Even if that only takes 50msec, it’s still a significant drain — especially as it risks missing some incoming data. But the optimisations above reduced temporary objects, and hence heap and GC activity, by a couple of orders of magnitude, making it much more stable under high load.

1 Like

I don’t know if you are still working on this project, but if you are, you should consider testing your application with either Shenandoah or ZGC (depending on if you are using OpenJDK or Oracle). Their pause times are much shorter than 50ms, and most importantly they are independent of the heap size. Check out the slides here - average pause times of 1ms-2ms for ZGC. As far as I know only Linux is supported at the time, but you’d have to check this yourself.

If you have some money to spend you might also look at Azul Zing.

1 Like