Replay stuck jobs
The Ops service runs background work for post-call processing, webhook delivery, and platform housekeeping. When it stalls, symptoms include "the call ended but the conversation never finished writing", missing recordings, undelivered webhooks, and stale dashboards.
Diagnose first
- Is Ops actually unhealthy, or just slow? Check container health on the Ops host and the SigNoz dashboards for the Ops service.
- Where is the backlog? Most queues live in Redis. Check Redis on the Database host for key prefixes that match the affected workflow (e.g. webhook delivery, recording post-processing).
- Is the upstream healthy? A webhook backlog is often a tenant whose endpoint started returning 5xx; you don't want to "replay" a queue that is correctly retrying.
Step 1 — Quiesce the producer if you can
If new jobs keep arriving while you investigate, the queue keeps growing. For tenant-driven workloads (webhooks, exports) you usually can't stop the source, so focus on the consumer. For internal workloads, you can pause the producer service temporarily.
Step 2 — Read before you replay
Look at a sample of the stuck jobs in Redis. Each Delphi background job has a payload visible without modification; print one and confirm:
- The payload looks well-formed.
- The target (URL, recording ID, user ID) still exists.
- The expected handler exists in the current platform version.
If any of those is wrong, stop and ask before replay — replaying a corrupted payload generally turns one bug into many.
Step 3 — Drain or re-enqueue
For webhook backlogs, prefer the Ops service's own retry policy over manual replay. If the upstream is healthy now, items already in the queue will deliver. Manual replay is for jobs the worker has given up on or jobs that were never enqueued because Ops was down at the time.
The exact mechanism depends on the workflow:
- Webhooks — use the team admin's webhook retry UI if available; otherwise, replay from Redis with a small batch first.
- Post-call processing — confirm
RECORDING_ENABLEDand the recordings bucket are healthy before retriggering. - Subscription / billing housekeeping — run after Stripe is confirmed reachable; replays that hit a 4xx from Stripe will not heal themselves.
Never replay a queue without a small-batch dry run first. Replay 1–10 items, observe the result end to end, and only then loosen the limit.
Step 4 — Verify
- The queue length trends down at a rate consistent with the worker count.
- The downstream effect actually happens (a delivered webhook, a written recording, an updated billing record).
- Error rate in SigNoz returns to baseline.
Step 5 — Document the cause
Once it's fixed, write a one-paragraph note in your operations log: when it started, the symptom, the root cause, and what you did. If the root cause was inside Delphi (not the upstream), file it via Getting help so we can decide whether it needs a code fix.
See also
- Ops operations — service reference.
- Database operations — Redis access.
- SigNoz monitoring — queue and worker dashboards.