Musings from debugging a production memory leak…

Much has been written on how to debug production memory leaks in Java, so rather than go into the nitty-gritty of how to comb through the heap, I thought I’d just highlight a couple of interesting points that came up as we were working on the issue… Note that this is based on an experience in which we had an issue in production that could not be replicated in test, so we had to start with a broad approach and then zero in on the cause…

  • Many people use the term “memory leak” to refer to a broad range of issues that cause memory use to go up.  The strictest definition of a “memory leak” is a situation in which the program has completely lost access to a chunk of memory, and therefore can never recover it.  Other situations which may result in memory usage going up include:
    • A cache that grows without bounds
    • A cache that is bounded, but exhausts all memory before it actually hits that limit
    • A single thread that processes a large object into memory (e.g. reading a very large zip file into memory)
    • A series of overlapping threads, each of which consumes a chunk of memory
  • Looking at graphs of memory usage can be really eye-opening, and can help to eliminate some theories, helping you to focus on what scenarios might match the actual shape of the graph.  For example, a staircase graph may indicate that something is grabbing a lot of memory infrequently.  Other shapes to consider include sawtooth and gradual slope.  It’s also interesting to check whether the memory graph is steeper when web traffic is higher, and less steep at off-peak times.
  • When diagnosing a memory issue, it can be useful to start up the system, let it “warm-up” so that caches etc. are loaded, and then take two snapshots at different points in time.  Comparing these snapshots (using a tool like YourKit) can then show you changes in size and number of various objects after the system warm-up phase.
  • It is possible (even common!) for a Java program to allocate non-heap memory that is not managed by Java.  One example would be when the program uses a 3rd-party C++ library that allocates memory.  This can cause a situation that may be impossible to debug with Java profilers, unless there is a Java object that is also hanging around that has some sort of connection to the C++ memory buffer. For example, if you’ve set -Xmx and -XX:MaxPermSize and your RSS is 2x that amount, you’re probably looking at an issue with off-heap memory…
  • Some Java objects allocate non-heap memory that is released only when the object is cleaned up by the garbage collector.  Since this is non-deterministic, it’s possible for a number of the Java objects to pile up before being garbage collected, resulting in exhaustion of non-heap memory.  ZipFile objects seem to often be implicated in these sorts of issues, and in fact it was related to the issue that we eventually tracked down.

Production memory issues can definitely be nail biters, but fixing the bug is also exhilarating.  We’re going to go crack open a few beers now – happy bug hunting to the rest of you!

Discussion