Version: 0.9.14

For platform operators

This page is a decision tree that hands off to the canonical per-service runbooks. It does not duplicate the deep-dive content under Platform → Operations — it just gets you to the right place quickly.

First two checks

SigNoz — open Monitoring in SigNoz and look at the service-health overview. Most symptoms are visible there before you SSH anywhere.
Release — confirm everything is on the expected ECR_TAG with docker compose images on each host. Mixed versions cause weird symptoms. See Version sources.

If both are green and you're still hitting an issue, descend.

Container / process

Symptom	Likely cause	Where to go
A container is `Restarting` / `Exit`	Crash on boot	Instance debugging → Logs section.
Healthcheck failing but container is up	Dependency unreachable	The service's own operations page (e.g. Web → Troubleshooting).
Healthy container but feature missing	Feature flag or env var	Feature flags + Environment variable index.
Container can't pull image	ECR login expired	Re-run the service `init.sh` to refresh ECR credentials.
Container OOM-killed	Sized too small for workload	Bump instance size, re-run `init.sh` to re-pull config, then restart.

Voice / SIP / WebRTC

Symptom	Where to go
Inbound calls aren't reaching us at all	TelPro operations — Kamailio, dispatcher, carrier trunk.
Calls ring but no AI	Voice operations → Troubleshooting — TelPhi ARI, OpenAI / provider connectivity, Squid.
WebRTC test call won't connect	Voice operations + check `WEBRTC_ENABLED`, Janus/TURN.
Asterisk won't start (90 s timeout)	Voice operations → Troubleshooting.
Recordings missing	Voice operations — `RECORDING_ENABLED`, `RECORDING_BUCKET`, IAM.

Database / Redis

Symptom	Where to go
API service can't connect to Postgres	Database operations + Managed database secrets.
Migration failed during `update.sh`	Init and update, Ops migrations, then Recovery → Redeploy a service.
Redis-bound features misbehaving (locks, queues, dispatchers)	Database operations — Redis section.
Sudden read errors after a `vars.yaml` change	Environment variable resolution — check what actually got resolved.

Configuration / secrets

Symptom	Where to go
Service ignores my `vars.yaml` change	You probably didn't re-run `update.sh`. See Init and update.
Service crashed after a secret rotation	Recovery → Rotate secrets.
You don't know where a var resolves from	Environment variable resolution + Environment variable index.
You don't know what value is actually in the container	Instance debugging → Environment section.
AWS access denied for SSM / SM	AWS security setup — IAM policies for hosts.

Observability

Symptom	Where to go
Telemetry stopped flowing	SigNoz monitoring — collector, network, ingest.
Traces missing for a specific service	SigNoz monitoring — confirm the OTel collector container on that host is healthy.
Logs in SigNoz but no spans	`log-to-span` container on Voice converts TelSys structured logs to spans — see Voice operations.

Networking / proxy

Symptom	Where to go
Outbound to providers (OpenAI etc.) failing	Squid operations — `HTTP_PROXY` / `HTTPS_PROXY`.
TLS errors from a container to internal media	Check `TTS_MEDIA_CACHE_CA_BUNDLE` per Voice operations. For platform-wide private CA, see Internal encryption.
Cross-host private IP unreachable	Provider-specific; verify security group / firewall rules from your IaC.

"I don't know which service"

If you can't even narrow it to a service:

Reproduce while watching SigNoz traces — the first span that errors usually points at the service.
If traces never reach SigNoz, drop to Instance debugging on the host that owns the inbound entry point (Web for UI symptoms, TelPro for SIP, API for SDK/REST).
Still stuck? File a ticket via Getting help with the trace ID (or its absence) and a 5-minute window.

First two checks​

Container / process​

Voice / SIP / WebRTC​

Database / Redis​

Configuration / secrets​

Observability​

Networking / proxy​

"I don't know which service"​