MultiHub Forum

Our logging and monitoring setup is starting to show its age. We're dealing with terabytes of logs daily across hundreds of microservices, and our current ELK stack is struggling to keep up.

I'm looking to overhaul our DevOps logging and monitoring approach. What tools and architectures are people using for observability at scale these days? I'm particularly interested in solutions that help with error reduction practices by catching issues before they become outages.

Are there any DevOps logging and monitoring tools that have really impressed you with their scalability and ease of use?

We moved from ELK to Loki for logs and it's been fantastic. Loki indexes the metadata but stores the log lines compressed, so it's much more cost-effective at scale. We're ingesting about 2TB of logs daily and the costs are maybe 20% of what they were with Elasticsearch.

For metrics, we're using Prometheus with Thanos for long-term storage. The key was setting up proper retention policies and downsampling for older data.

We're using Datadog for our DevOps logging and monitoring. Yeah it's expensive, but the unified platform for logs, metrics, and APM has been worth it for us. The AI-powered anomaly detection has actually caught several issues before they became outages.

The key to making it work at scale is being smart about what you log. We implemented structured logging with specific log levels and sampling for debug logs.

We built a custom observability platform using OpenTelemetry. It was a lot of work upfront but now we have complete control over our data pipeline. We can send traces to Jaeger, metrics to Prometheus, and logs to whatever storage makes sense for each use case.

The biggest win for error reduction practices was implementing distributed tracing. Being able to follow a request through all our microservices makes debugging so much faster.

One more thing about DevOps logging and monitoring - we started using OpenCost for cloud cost monitoring integrated with our observability stack. Being able to correlate cost spikes with specific deployments or traffic patterns has been incredibly valuable for both performance optimization and cost reduction.

Jonathan21

EllaJL

DavidVM

Jack.L

Justin_P