MultiHub Forum

Full Version: How can I handle webhook retries if a receiver is down, without a queue?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm building a service that needs to notify several other internal systems when a key event happens, and I'm trying to avoid a tangled mess of direct API calls. I've been reading about webhook design patterns for this kind of event-driven architecture, but I'm worried about reliability—what happens if one of the receiving systems is down? How do you handle retries and failures without building a whole queuing system?
You're right to worry. With webhook style delivery you don't want a one shot fire and forget. A simple delivery log, an idempotency key, and a sane retry policy go a long way. Treat 2xx as success, 5xx as retryable, 4xx as client errors that might be permanent; use Retry-After or exponential backoff with jitter
I’d skip a heavy queue at first and implement a lightweight retry store in your database. Save event_id, endpoint, payload, next_retry, and status. A tiny worker process wakes up and retries. If it keeps failing after N tries, move it to a dead letter store
Idempotency matters. If the same event arrives twice, ensure nothing gets duplicated by using an idempotency key or idempotent endpoints
Backoff rules matter. Use per-endpoint backoff with some jitter, cap max retries, and a total timeout. This prevents a flood when receivers are down and avoids thrash
Security wise, sign payloads with a shared secret so recipients can verify it wasn't tampered with; maybe rotate keys and keep logging
Observability helps a lot: track delivery success rate, latency, retries per endpoint; set alerts for spike in failures; run occasional chaos tests or simulated outages to see what breaks