platform v0.9.11verified 2026-05-14
This page is a decision tree that hands off to the canonical per-service runbooks. It does not duplicate the deep-dive content under Platform → Operations — it just gets you to the right place quickly.
First two checks
- SigNoz — open Monitoring in SigNoz and look at the service-health overview. Most symptoms are visible there before you SSH anywhere.
- Release — confirm everything is on the expected
ECR_TAG with docker compose images on each host. Mixed versions cause weird symptoms. See Version sources.
If both are green and you're still hitting an issue, descend.
Container / process
| Symptom | Likely cause | Where to go |
|---|
A container is Restarting / Exit | Crash on boot | Instance debugging → Logs section. |
| Healthcheck failing but container is up | Dependency unreachable | The service's own operations page (e.g. Web → Troubleshooting). |
| Healthy container but feature missing | Feature flag or env var | Feature flags + Environment variable index. |
| Container can't pull image | ECR login expired | Re-run the service init.sh to refresh ECR credentials. |
| Container OOM-killed | Sized too small for workload | Bump instance size, re-run init.sh to re-pull config, then restart. |
Voice / SIP / WebRTC
Database / Redis
Configuration / secrets
Observability
| Symptom | Where to go |
|---|
| Telemetry stopped flowing | SigNoz monitoring — collector, network, ingest. |
| Traces missing for a specific service | SigNoz monitoring — confirm the OTel collector container on that host is healthy. |
| Logs in SigNoz but no spans | log-to-span container on Voice converts TelSys structured logs to spans — see Voice operations. |
Networking / proxy
| Symptom | Where to go |
|---|
| Outbound to providers (OpenAI etc.) failing | Squid operations — HTTP_PROXY / HTTPS_PROXY. |
| TLS errors from a container to internal media | Check TTS_MEDIA_CACHE_CA_BUNDLE per Voice operations. For platform-wide private CA, see Internal encryption. |
| Cross-host private IP unreachable | Provider-specific; verify security group / firewall rules from your IaC. |
"I don't know which service"
If you can't even narrow it to a service:
- Reproduce while watching SigNoz traces — the first span that errors usually points at the service.
- If traces never reach SigNoz, drop to Instance debugging on the host that owns the inbound entry point (Web for UI symptoms, TelPro for SIP, API for SDK/REST).
- Still stuck? File a ticket via Getting help with the trace ID (or its absence) and a 5-minute window.