Bootstrap and init
Once the instance is provisioned with the right bootstrap .env and a way to fetch the config bundle from S3, every service uses the same init.sh flow. The same flow runs on first boot and on every subsequent restart — it is idempotent.
The whole point of this flow is that secrets never land on disk. fetch-env.sh prints export VAR=value lines to stdout; init.sh evaluates the output into the shell, then docker compose up inherits the environment. There is no .env file in /opt/services/<role>/ — init.sh actively shreds any leftover one as a safety net.
What init.sh does, step by step
Every service ships its own init.sh (each one specialized for that role's needs), but they all follow the same outline. The lines below are taken from the API service's script; every other service is structurally similar.
1. Load the bootstrap environment
DEPLOYMENT_ENV="/opt/deployment/.env"
source "$DEPLOYMENT_ENV"
This brings ENVIRONMENT, HOSTNAME, NAMESPACE, ECR_REGISTRY, ECR_TAG, CONFIG_BUCKET, CONFIG_REF, AWS credentials, and any proxy variables into the shell. None of these come from AWS — they were placed by cloud-init.
2. Pull the latest config bundle from S3 (unless --skip-fetch)
if [ "$SKIP_FETCH" = false ] && command -v fetch-config &>/dev/null; then
fetch-config
fi
fetch-config is a one-liner placed by cloud-init that runs aws s3 sync s3://${CONFIG_BUCKET}/deployments/<slug>/${CONFIG_REF}/<role>/ /opt/services/<role>/ plus the same for common/. After this step the host has the latest docker-compose.yaml, init.sh, vars.yaml, and any service-specific files (Caddyfile, otel-collector-config.yaml, …) for the configured CONFIG_REF.
If you are running update.sh --config-ref <new-ref> to roll a config change, this is the step that pulls the new bundle.
3. Configure Docker log rotation (first run only)
if [ ! -f /etc/docker/daemon.json ] || ! grep -q max-size ...; then
cat > /etc/docker/daemon.json <<DAEMONJSON
{ "log-driver": "json-file", "log-opts": { "max-size": "20m", "max-file": "3" } }
DAEMONJSON
systemctl restart docker
fi
Docker logs ship to SigNoz over OTel; the local rotation just prevents disk fill if the OTel collector is briefly down.
4. Auto-detect PRIVATE_IP if not supplied
Some scripts (API, Voice) auto-detect PRIVATE_IP from the first 10.x interface and persist it to /opt/deployment/.env if it wasn't already set. This keeps the bootstrap .env minimal in cloud-init and lets the host figure out its own private IP at runtime.
5. Resolve env from SSM / Secrets Manager into memory
The single most important step:
COMMON_DIR="$(cd "$SERVICE_DIR/../common" && pwd)"
ENV_EXPORTS="$("$COMMON_DIR/fetch-env.sh" \
--manifest "$SERVICE_DIR/vars.yaml" \
--bootstrap "$DEPLOYMENT_ENV" \
--format export)"
eval "$ENV_EXPORTS"
fetch-env.sh reads vars.yaml, fetches every declared variable from the right source (bootstrap, SSM, SM, external secret ARN, default, or compose.template), and prints export VAR=value lines to stdout. The eval lifts them into the shell that's about to run docker compose up.
The full resolution order, namespace conventions, and gotchas are in Environment resolution.
6. Service-specific "magic"
This is the per-service variation. Each init.sh sandwiches one or more of these between step 5 and docker compose up:
| Service | Magic |
|---|---|
| API | prepare-tls.sh decodes INTERNAL_CA_CRT_B64 from SM into ./tls/ca.crt (mounted into the container); fall-back to an empty placeholder file when not provided. |
| Web | Caddy auto-issues Let's Encrypt for DOMAIN_TELWEB. TelWeb entrypoint: Prisma migrate deploy then baseline prisma db seed; operators run docker compose exec -it voiceai-telweb delphi-setup after the container is healthy — see First use. |
| Voice | Copies log-to-span binary into the right path; prepare-tls.sh for the optional Postgres/Redis CA bundle. |
| TelPro | Maps LOG_LEVEL → numeric RTP_LOG_LEVEL / DEBUG_LEVEL; resolves TLS via Certbot, SM-decoded PEM, manual files, or self-signed (in priority order). |
| Database | prepare-tls-server.sh decodes INTERNAL_TLS_CERT_B64 / INTERNAL_TLS_KEY_B64 / INTERNAL_CA_CRT_B64 from SM, sets ownership for postgres + pgbouncer. |
| Media | TLS resolution in three branches (MEDIA_TLS_*_B64 env-PEM → existing files → self-signed); writes tls/ca-for-clients.pem for downstream consumers to trust. |
| SigNoz | Clones the upstream SigNoz repo; converts named volumes to bind mounts on /mnt/signoz-data; configures redsocks-based transparent proxy via Squid. |
| Ops | Picks SMTP vs SES email transport from EMAIL_TRANSPORT; static AWS keys must be unset when AWS_USE_INSTANCE_PROFILE=true. |
| Squid | None — the container is configured entirely via squid.conf. |
7. Log into the registry and pull images (unless --skip-images)
aws ecr get-login-password --region "${AWS_REGION:-eu-central-1}" | \
docker login --username AWS --password-stdin "${ECR_REGISTRY}"
docker compose pull
8. Safety: shred any leftover .env
if [ -f "$SERVICE_DIR/.env" ]; then
shred -u "$SERVICE_DIR/.env" 2>/dev/null || rm -f "$SERVICE_DIR/.env"
fi
This catches the case where an operator accidentally left a .env from manual debugging — the live secrets the containers are about to use are in the parent shell's environment, not on disk.
9. Start Compose
docker compose up -d --force-recreate --remove-orphans
The current shell (with all the evald exports) is what docker compose inherits. Compose then injects each variable into the right container according to the environment: and ${VAR} substitutions in docker-compose.yaml.
10. Healthcheck loop and image pruning
Each service waits for its primary container to report ready (/health/live, pg_isready, redis-cli ping, etc.), then prunes dangling and unused Docker images so disk usage stays bounded.
Same flow, every restart
The same init.sh runs on every restart. Use the standard flags to skip steps that don't apply:
--restart-only— skipfetch-configand image pulls; just re-resolve env anddocker compose up. Use this after a value changes in SSM / Secrets Manager.--skip-images—fetch-configruns but nodocker pull; useful when iterating on adocker-compose.yamlchange in the bundle without rolling images.--skip-fetch— don't sync from S3; useful if you're testing local edits to/opt/services/<role>/(be careful — the nextupdate.shwill overwrite them).
The deployment-manager script update.sh (shared across all services) is the recommended entry point in steady state — it updates /opt/deployment/.env (notably ECR_TAG / CONFIG_REF), prompts for confirmation, takes a backup, and then calls init.sh. Full rundown in init.sh and update.sh.
Smoke test the deployment
Once every service is up:
| Check | How |
|---|---|
docker compose ps clean on every host | SSH via Bastion to each host; expect every container Up and any healthchecks healthy. |
| Postgres reachable | From the Database host: docker exec -it voiceai-postgres pg_isready -U voiceai. |
| Redis reachable | From the Database host: docker exec -it voiceai-redis redis-cli -a "$REDIS_PASSWORD" PING. |
| API healthy through the LB | curl https://${DOMAIN_API}/health should return 200. |
| Dashboard loads | Browse to https://${DOMAIN_TELWEB}; first load may take 30–40s while migrations + baseline seed run. |
| SigNoz UI loads | Browse to https://${DOMAIN_SIGNOZ}; you should see traces from every service. |
| OTel pipeline | On any service host: curl http://10.0.1.10:4318/v1/traces (4xx is fine — it proves reachability). |
| First call (with a SIP carrier) | Place a call to a DID routed to TelPro. Check voiceai-telphi logs for an OpenAI realtime session opening. |
| First call (WebRTC) | Open the WebRTC phone in TelWeb's call dialer; the connection should succeed in one click. |
If any step fails, the per-service operations pages have a Troubleshooting tab keyed on the symptom you are seeing.
When TelWeb is Up (healthy), continue to First use on TelWeb (delphi-setup). After that, day-2 ops live in the per-service operations pages and the configuration section.