Ops service operations
The Ops service runs infrastructure management and background processing.
- Scaler — autoscales API and Voice instances based on utilization metrics. Provider-agnostic via per-provider shell scripts (
scaleUp.sh,scaleDown.sh,scalingStatus.sh). - Tasker — runs scheduled and queued background tasks (DB backups, maintenance, recording processing, email notifications) on a Redis-backed job queue.
Both components use Redis-based leader election with separate keys so only one active leader processes at a time.
- Overview
- Runbook
- Configuration
- Troubleshooting
Containers
| Container | Base | Purpose |
|---|---|---|
voiceai-scaler | Node 24-alpine | Distributed scaling orchestrator |
voiceai-tasker | Node 24-alpine | Background job + cron runner |
voiceai-otel-collector | otel/opentelemetry-collector-contrib:0.150.1 | Telemetry collector |
Neither the Scaler nor the Tasker exposes an HTTP health endpoint. Their health is determined by container status and log output.
Scaler decision loop
- Every
SCALING_EVALUATION_INTERVAL_MS(default 30s), the leader reads utilization metrics from Redis againstscaleUpThreshold/scaleDownThreshold+ min/max fromServerGroup.scalingConfigin Postgres. - On scale up, fetch optional bootstrap secrets from Secrets Manager (
secretsName), generate a cloud-init viagenerate-cloud-init.sh, runscaleUp.sh. - Poll
scalingStatus.shuntil ready, then wait for a service heartbeat in Redis (~5 min for cloud-init + container startup). - Voice: append to the Kamailio dispatcher set in Redis. API: managed LB picks up via labels.
- Cooldown — gates the next decision.
Scale-down reverses the flow: remove from routing, drain active calls (Voice waits up to 60 min), delete the server, clean Redis + Postgres state.
Email (SMTP or SES)
EMAIL_TRANSPORT selects between classic SMTP (default) and AWS SES on EC2.
SMTP: needs SMTP_HOST, SMTP_PORT, SMTP_USER, SMTP_PASS. Outbound traffic goes through Squid.
SES on EC2:
EMAIL_TRANSPORT=ses
AWS_USE_INSTANCE_PROFILE=true
AWS_REGION=eu-central-1
EMAIL_SENDER_ADDRESS=noreply@yourdomain.tld
EMAIL_SENDER_NAME="Delphi"
SES IAM policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["ses:SendEmail", "ses:SendRawEmail"],
"Resource": "arn:aws:ses:<region>:<account-id>:identity/<verified-domain>"
}
]
}
EC2 / IMDS prerequisites:
- IMDS hop limit = 2 — Docker bridge adds a hop, the default of 1 stops the container from reaching
169.254.169.254:aws ec2 modify-instance-metadata-options \--instance-id i-xxxxxxxx \--http-put-response-hop-limit 2 \--http-tokens required - Attach the instance role with the SES policy above.
- If running with
HTTPS_PROXY/HTTP_PROXY, add169.254.169.254toNO_PROXY. - SES account: verify sender domain (preferred) or address, enable DKIM, request production access if the account is sandbox-only.
Static AWS keys (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) must be unset on instance-profile deployments — the AWS SDK default chain prefers them and the IMDS intent gets defeated.
Verification:
docker compose logs tasker | grep 'SES: Initialized'
docker compose logs tasker | grep NotificationService
Bootstrap variables for scaled instances
When the Scaler creates a new Voice / API instance it passes its own env to generate-cloud-init.sh. The values come from a mix of sources:
| Variable | Source |
|---|---|
ENVIRONMENT | bootstrap (source: local) |
ECR_REGISTRY, ECR_TAG | bootstrap |
NAMESPACE, CONFIG_BUCKET, CONFIG_REF | SSM |
BASTION_PUBLIC_KEY | SSM |
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION | bootstrap |
HTTP_PROXY / HTTPS_PROXY | SSM |
Important: changing NAMESPACE / CONFIG_BUCKET / BASTION_PUBLIC_KEY in /opt/deployment/.env but not in SSM is a footgun — the SSM value wins after fetch-env.sh. Always update SSM.
Scaler
| Name | Source | Scope | Default | Description |
|---|---|---|---|---|
SCALING_EVALUATION_INTERVAL_MS | SSM | all | 30000 | How often the leader evaluates scaling rules. |
LEADER_ELECTION_KEY | SSM | all | voiceai:scaler:leader | Redis key for leader election. |
LEADER_HEARTBEAT_INTERVAL_MS | SSM | all | 5000 | Leader heartbeat interval. |
LEADER_LOCK_TTL_MS | SSM | all | 10000 | Leader lock TTL. |
HETZNER_API_TOKEN | Secrets Manager | all | — | Provider API token (when scaling on Hetzner). |
HTTP_PROXY | SSM | all | — | Squid proxy for outbound API calls. |
NO_PROXY | SSM | all | localhost,127.0.0.1,10.0.0.0/8 | Proxy bypass list; add 169.254.169.254 for IMDS access. |
Tasker
| Name | Source | Scope | Default | Description |
|---|---|---|---|---|
WORKER_CONCURRENCY | SSM | all | 5 | Max concurrent job workers. |
WORKER_POLL_INTERVAL_MS | SSM | all | 5000 | Worker poll interval. |
SCHEDULER_POLL_INTERVAL_MS | SSM | all | 10000 | Scheduler poll interval. |
TASKER_LEADER_ELECTION_KEY | SSM | all | voiceai:tasker:leader | Redis key for leader election. |
S3_BUCKET | SSM | all | — | S3 bucket for database backups. |
S3_PREFIX | SSM | all | database-dumps | Key prefix for backups. |
EMAIL_TRANSPORT | SSM | all | smtp | smtp | ses. |
SMTP_HOST | SSM | all | — | SMTP host (when EMAIL_TRANSPORT=smtp). |
SMTP_PORT | SSM | all | — | SMTP port. |
SMTP_USER | Secrets Manager | all | — | SMTP username. |
SMTP_PASS | Secrets Manager | all | — | SMTP password. |
AWS_USE_INSTANCE_PROFILE | SSM | all | — | true on EC2 SES deployments. |
AWS_REGION | SSM | all | eu-central-1 | AWS region for SES and other AWS clients. |
EMAIL_SENDER_ADDRESS | SSM | all | — | From address (must be a verified SES identity). |
Scaler
| Symptom | Likely cause | Check |
|---|---|---|
| No scaling happening | Not running or not leader | GET voiceai:scaler:leader in Redis. |
| Scale-up fails | Provider API credentials missing or rate-limited | Scaler logs for API errors; verify HETZNER_API_TOKEN / AWS credentials. |
| New instance not healthy | Cloud-init failed | SSH via bastion; cat /var/log/cloud-init-output.log. |
| Scale-down too aggressive | Evaluation interval too short | Bump SCALING_EVALUATION_INTERVAL_MS. |
| SMTP / SES emails missing | Tasker side | Scaler enqueues only; check Tasker logs. |
Tasker
| Symptom | Likely cause | Check |
|---|---|---|
| Jobs not running | Not leader (for scheduled jobs) | GET voiceai:tasker:leader. |
| Jobs stuck | Worker concurrency or Redis | WORKER_CONCURRENCY; verify Redis connectivity. |
SES 403 / MessageRejected | Sender unverified or IAM missing | Verify identity in SES console; check IAM policy. |
SES CredentialsProviderError | IMDS hop limit / role | See SES section in Runbook. |
| SMTP failures | Squid blocking or auth wrong | Check Squid ACLs; rotate SMTP creds. |
See also
- Voice operations — scale target.
- API operations — scale target.
- Database operations — Tasker runs DB backups here.
- Squid operations — required for SMTP and provider API calls.