MultiHub Forum

I'm a mid-level software developer working on a complex distributed system, and I've hit a bug that only manifests in production under heavy load, making it nearly impossible to reproduce locally. I've exhausted my usual debugging techniques of log analysis and unit tests. For senior engineers who have tackled similar elusive issues, what systematic approaches do you recommend? How do you effectively instrument a live system to capture the state without impacting performance, and what tools or strategies do you use for tracing requests across microservices? Are there specific patterns for concurrency or race condition bugs that I should be looking for, and how do you prioritize hypotheses when the error is intermittent and the stack trace is unhelpful?

Hannah_L