CMSClassUnloadingEnabled at Redfin

tl;dr: If you’re using the CMS collector in Java 6 (-XX:+UseConcMarkSweepGC) server-side, then you should really consider -XX:+CMSClassUnloadingEnabled

Today’s post starts out with a story. You might have heard a similar story before. It begins with a handful of servers in a datacenter that just…. pause sometimes. There doesn’t seem to be any rhyme or reason – it just happens. And when they pause, database transactions start piling up behind locks held by the affected machine, and requests to a simple status controller can take multiple tens of seconds to return. Then the warning emails arrive about a long garbage collection event on webserver #3, and your worst fears are confirmed…

A couple years ago here at Redfin, we started having website “brownouts” where webapps would just stop responding for anywhere from 30s to 5 minutes at a time. After some digging, we determined that garbage collection was the cause. We spent a couple weeks running experiments and trying out different settings, and the end result was that we dramatically decreased the frequency of the pauses by both tweaking the heap and permgen sizes and also by switching from the default (Parallel) collector to the Concurrent (CMS) collector, which is better suited for applications with short pause time requirements (like our website).

The CMS collector is described in detail in the Sun (now Oracle) Hotspot Guide and we highly recommend reading it, but for the purposes of this post, the biggest difference between the Parallel collector and the CMS collector is how each goes about cleaning up the old generation of the heap. In the Parallel collector, when the oldgen or permgen fill up completely, the garbage collector pauses all the running threads, and then uses multiple threads to traverse the whole heap (young, old, and perm gens) and clean up. The CMS collector does as much as it can concurrently with the running application, and while it does have to pause twice during collections, its pauses tend to be much shorter. Unfortunately, the CMS collector by default only helps with oldgen (tenured) collections. The JVM provides a couple of other flags that allow you to use the Parallel (multi-threaded) collector on the young generation while using the CMS collector on the old generation, but that still leaves the opportunity for the permgen to fill up.

In our webapp, there are a couple of things that cause permgen usage to grow over time. The first is that we use Hibernate in our data access layer, and Hibernate creates proxy objects on the fly. Those proxy objects have class and method definitions that are stored in the permgen, which causes the permgen to grow slightly over time. Another thing that causes permgen usage to grow for us is that our content management system pushes JSP files to our webservers that then have to be compiled by Tomcat – the definitions for those classes are stored in permgen as well. Things like classloader leaks can also cause permgen to grow, but we don’t reload web applications with Tomcat in production, so that doesn’t really apply in our situation.

When permgen fills up it is a Very Bad Thing. By default, the CMS collector doesn’t collect from the permgen and when it fills up it triggers a “Stop the World” pause. EVERYTHING is collected – the young, old, and permanent generations. These “Stop the World” pauses are the ones that were causing our five-minute brownouts in production. They sucked, but after our tuning they were much more infrequent and were mostly covered up by our load balancer, so we were content to move on to other things (read: shipping features). And that worked pretty well up until a couple months ago, when the pauses started happening more frequently again…

This time we were a lot more prepared. Armed with a post-mortem writeup of what we did the last time servers were pausing, we went about setting up some experiments on prod boxes with tweaks to heap sizes. But this time in the course of refreshing our memories about the various JVM options, we stumbled across one we hadn’t tried before: -XX:+CMSClassUnloadingEnabled. It wasn’t really obvious at first, but this would turn out to be a huge help in our quest for shorter worst-case GC pause times.

Finding documentation that mentions -XX:+CMSClassUnloadingEnabled is difficult unless you are searching for it explicitly. And even then, what you find isn’t always up-to-date or helpful. But if you wade through it all, eventually you’ll find that -XX:+CMSClassUnloadingEnabled does pretty much exactly what it says it does: it allows the CMS collector to sweep the permgen during oldgen GC and unload classes that are no longer in use, so you don’t have to wait for it to fill up and trigger a Stop the World GC.

Eureka! That’s it! (Right!?)

Wait… If it’s really a silver bullet, why isn’t it on by default? And how do you know it isn’t on by default?

Well, we knew it wasn’t on by default because the permgen was never being reclaimed except during full collections. We confirmed it by passing -XX:+PrintFlagsFinal as a JVM option as described in Inspecting Hotspot JVM Options. By passing that flag to java, you can see the options available in the JVM, and what their value is when the application is run. Some additional digging also revealed this email thread, which explains that -XX:+CMSClassUnloadingEnabled can cause the remark (second) pause during CMS GC to take longer, which is why it isn’t enabled by default. So it might be a silver bullet, but not for every use case.

We haven’t measured the exact length of the “longer” remark pause yet, but it is definitely less than 15s in all but a very small number of collections (maybe 3 in the last week). For us, that means it’s time to get back to shipping features, until the next time it’s time to tweak GC settings.

Discussion