MultiHub Forum

Full Version: Diagnosing memory leaks causing OOM after a week in a long-running Java app
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm maintaining a legacy Java application that's been running in production for years, and we've started experiencing gradual performance degradation followed by OutOfMemoryErrors after about a week of uptime. I suspect a memory leak, but I'm struggling to pinpoint it with the standard profiling tools. I've taken heap dumps and analyzed them with Eclipse MAT, but the sheer size and complexity of the object graphs make it difficult to isolate the root cause. For developers who have hunted down subtle memory leaks in long-running JVM applications, what specific techniques or tools did you find most effective? Should I focus on monitoring specific garbage collector behavior or are there more advanced heap analysis strategies I should try?
Here's a practical triage plan you can start today: verify whether the OOM is heap or non-heap (metaspace or native) by the error message and GC patterns, then collect repeated heap dumps under ongoing load. Use the heap histogram and dominator tree in MAT to identify large retainers, and pair with Java Flight Recorder for long‑running traces. Common culprits are unbounded caches, static singletons, ThreadLocals, and unclosed resources. Triage one suspect at a time and re-test under load.
Be cautious about relying on a single dump; long-running leaks can be subtle. Favor time-series profiling (JFR) and compare heap growth across several dumps to confirm a real trend before chasing a root cause.
Consider additional tools beyond MAT: VisualVM, YourKit, or JProfiler for live heap and allocation profiling; enable GC logs and use allocation hotspots to spot where most objects originate.
Look for concrete patterns: caches that never shrink, large maps with string keys, or thread-locals leaking across tasks; also check for classloader leaks after redeploys.
Set up a staged profiling plan: reproduce load in a staging environment, collect data for 24–48 hours, then apply targeted fixes and re-test; keep a changelog and rollback plan.