Skip to main content

For platform operators

platform v0.9.11verified 2026-05-14

This page is a decision tree that hands off to the canonical per-service runbooks. It does not duplicate the deep-dive content under Platform → Operations — it just gets you to the right place quickly.

First two checks

  1. SigNoz — open Monitoring in SigNoz and look at the service-health overview. Most symptoms are visible there before you SSH anywhere.
  2. Release — confirm everything is on the expected ECR_TAG with docker compose images on each host. Mixed versions cause weird symptoms. See Version sources.

If both are green and you're still hitting an issue, descend.

Container / process

SymptomLikely causeWhere to go
A container is Restarting / ExitCrash on bootInstance debugging → Logs section.
Healthcheck failing but container is upDependency unreachableThe service's own operations page (e.g. Web → Troubleshooting).
Healthy container but feature missingFeature flag or env varFeature flags + Environment variable index.
Container can't pull imageECR login expiredRe-run the service init.sh to refresh ECR credentials.
Container OOM-killedSized too small for workloadBump instance size, re-run init.sh to re-pull config, then restart.

Voice / SIP / WebRTC

SymptomWhere to go
Inbound calls aren't reaching us at allTelPro operations — Kamailio, dispatcher, carrier trunk.
Calls ring but no AIVoice operations → Troubleshooting — TelPhi ARI, OpenAI / provider connectivity, Squid.
WebRTC test call won't connectVoice operations + check WEBRTC_ENABLED, Janus/TURN.
Asterisk won't start (90 s timeout)Voice operations → Troubleshooting.
Recordings missingVoice operationsRECORDING_ENABLED, RECORDING_BUCKET, IAM.

Database / Redis

SymptomWhere to go
API service can't connect to PostgresDatabase operations + Managed database secrets.
Migration failed during update.shWeb operations → migration section, then Recovery → Redeploy a service.
Redis-bound features misbehaving (locks, queues, dispatchers)Database operations — Redis section.
Sudden read errors after a vars.yaml changeEnvironment variable resolution — check what actually got resolved.

Configuration / secrets

SymptomWhere to go
Service ignores my vars.yaml changeYou probably didn't re-run update.sh. See Init and update.
Service crashed after a secret rotationRecovery → Rotate secrets.
You don't know where a var resolves fromEnvironment variable resolution + Environment variable index.
You don't know what value is actually in the containerInstance debugging → Environment section.
AWS access denied for SSM / SMAWS security setup — IAM policies for hosts.

Observability

SymptomWhere to go
Telemetry stopped flowingSigNoz monitoring — collector, network, ingest.
Traces missing for a specific serviceSigNoz monitoring — confirm the OTel collector container on that host is healthy.
Logs in SigNoz but no spanslog-to-span container on Voice converts TelSys structured logs to spans — see Voice operations.

Networking / proxy

SymptomWhere to go
Outbound to providers (OpenAI etc.) failingSquid operationsHTTP_PROXY / HTTPS_PROXY.
TLS errors from a container to internal mediaCheck TTS_MEDIA_CACHE_CA_BUNDLE per Voice operations. For platform-wide private CA, see Internal encryption.
Cross-host private IP unreachableProvider-specific; verify security group / firewall rules from your IaC.

"I don't know which service"

If you can't even narrow it to a service:

  1. Reproduce while watching SigNoz traces — the first span that errors usually points at the service.
  2. If traces never reach SigNoz, drop to Instance debugging on the host that owns the inbound entry point (Web for UI symptoms, TelPro for SIP, API for SDK/REST).
  3. Still stuck? File a ticket via Getting help with the trace ID (or its absence) and a 5-minute window.