How can I diagnose production-only bugs under heavy load across microservices?
#1
I'm a mid-level software developer working on a complex distributed system, and I've hit a bug that only manifests in production under heavy load, making it nearly impossible to reproduce locally. I've exhausted my usual debugging techniques of log analysis and unit tests. For senior engineers who have tackled similar elusive issues, what systematic approaches do you recommend? How do you effectively instrument a live system to capture the state without impacting performance, and what tools or strategies do you use for tracing requests across microservices? Are there specific patterns for concurrency or race condition bugs that I should be looking for, and how do you prioritize hypotheses when the error is intermittent and the stack trace is unhelpful?
Reply


[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Forum Jump: