Ops service operations

platform v0.9.11verified 2026-05-14

The Ops service runs infrastructure management and background processing.

Scaler — autoscales API and Voice instances based on utilization metrics. Provider-agnostic via per-provider shell scripts (scaleUp.sh, scaleDown.sh, scalingStatus.sh).
Tasker — runs scheduled and queued background tasks (DB backups, maintenance, recording processing, email notifications) on a Redis-backed job queue.

Both components use Redis-based leader election with separate keys so only one active leader processes at a time.

Overview
Runbook
Configuration
Troubleshooting

Containers

Container	Base	Purpose
`voiceai-scaler`	Node 24-alpine	Distributed scaling orchestrator
`voiceai-tasker`	Node 24-alpine	Background job + cron runner
`voiceai-otel-collector`	`otel/opentelemetry-collector-contrib:0.150.1`	Telemetry collector

Neither the Scaler nor the Tasker exposes an HTTP health endpoint. Their health is determined by container status and log output.

Scaler decision loop

Every SCALING_EVALUATION_INTERVAL_MS (default 30s), the leader reads utilization metrics from Redis against scaleUpThreshold / scaleDownThreshold + min/max from ServerGroup.scalingConfig in Postgres.
On scale up, fetch optional bootstrap secrets from Secrets Manager (secretsName), generate a cloud-init via generate-cloud-init.sh, run scaleUp.sh.
Poll scalingStatus.sh until ready, then wait for a service heartbeat in Redis (~5 min for cloud-init + container startup).
Voice: append to the Kamailio dispatcher set in Redis. API: managed LB picks up via labels.
Cooldown — gates the next decision.

Scale-down reverses the flow: remove from routing, drain active calls (Voice waits up to 60 min), delete the server, clean Redis + Postgres state.

Email (SMTP or SES)

EMAIL_TRANSPORT selects between classic SMTP (default) and AWS SES on EC2.

SMTP: needs SMTP_HOST, SMTP_PORT, SMTP_USER, SMTP_PASS. Outbound traffic goes through Squid.

SES on EC2:

EMAIL_TRANSPORT=ses
AWS_USE_INSTANCE_PROFILE=true
AWS_REGION=eu-central-1
EMAIL_SENDER_ADDRESS=noreply@yourdomain.tld
EMAIL_SENDER_NAME="Delphi"

SES IAM policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["ses:SendEmail", "ses:SendRawEmail"],
      "Resource": "arn:aws:ses:<region>:<account-id>:identity/<verified-domain>"
    }
  ]
}

EC2 / IMDS prerequisites:

IMDS hop limit = 2 — Docker bridge adds a hop, the default of 1 stops the container from reaching 169.254.169.254:

aws ec2 modify-instance-metadata-options \
  --instance-id i-xxxxxxxx \
  --http-put-response-hop-limit 2 \
  --http-tokens required

Attach the instance role with the SES policy above.
If running with HTTPS_PROXY / HTTP_PROXY, add 169.254.169.254 to NO_PROXY.
SES account: verify sender domain (preferred) or address, enable DKIM, request production access if the account is sandbox-only.

Static AWS keys (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY) must be unset on instance-profile deployments — the AWS SDK default chain prefers them and the IMDS intent gets defeated.

Verification:

docker compose logs tasker | grep 'SES: Initialized'
docker compose logs tasker | grep NotificationService

Bootstrap variables for scaled instances

When the Scaler creates a new Voice / API instance it passes its own env to generate-cloud-init.sh. The values come from a mix of sources:

Variable	Source
`ENVIRONMENT`	bootstrap (`source: local`)
`ECR_REGISTRY`, `ECR_TAG`	bootstrap
`NAMESPACE`, `CONFIG_BUCKET`, `CONFIG_REF`	SSM
`BASTION_PUBLIC_KEY`	SSM
`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`	bootstrap
`HTTP_PROXY` / `HTTPS_PROXY`	SSM

Important: changing NAMESPACE / CONFIG_BUCKET / BASTION_PUBLIC_KEY in /opt/deployment/.env but not in SSM is a footgun — the SSM value wins after fetch-env.sh. Always update SSM.

Scaler

Name	Source	Scope	Default	Description
`SCALING_EVALUATION_INTERVAL_MS`	SSM	all	`30000`	How often the leader evaluates scaling rules.
`LEADER_ELECTION_KEY`	SSM	all	`voiceai:scaler:leader`	Redis key for leader election.
`LEADER_HEARTBEAT_INTERVAL_MS`	SSM	all	`5000`	Leader heartbeat interval.
`LEADER_LOCK_TTL_MS`	SSM	all	`10000`	Leader lock TTL.
`HETZNER_API_TOKEN`	Secrets Manager	all	—	Provider API token (when scaling on Hetzner).
`HTTP_PROXY`	SSM	all	—	Squid proxy for outbound API calls.
`NO_PROXY`	SSM	all	`localhost,127.0.0.1,10.0.0.0/8`	Proxy bypass list; add 169.254.169.254 for IMDS access.

Tasker

Name	Source	Scope	Default	Description
`WORKER_CONCURRENCY`	SSM	all	`5`	Max concurrent job workers.
`WORKER_POLL_INTERVAL_MS`	SSM	all	`5000`	Worker poll interval.
`SCHEDULER_POLL_INTERVAL_MS`	SSM	all	`10000`	Scheduler poll interval.
`TASKER_LEADER_ELECTION_KEY`	SSM	all	`voiceai:tasker:leader`	Redis key for leader election.
`S3_BUCKET`	SSM	all	—	S3 bucket for database backups.
`S3_PREFIX`	SSM	all	`database-dumps`	Key prefix for backups.
`EMAIL_TRANSPORT`	SSM	all	`smtp`	smtp \| ses.
`SMTP_HOST`	SSM	all	—	SMTP host (when EMAIL_TRANSPORT=smtp).
`SMTP_PORT`	SSM	all	—	SMTP port.
`SMTP_USER`	Secrets Manager	all	—	SMTP username.
`SMTP_PASS`	Secrets Manager	all	—	SMTP password.
`AWS_USE_INSTANCE_PROFILE`	SSM	all	—	true on EC2 SES deployments.
`AWS_REGION`	SSM	all	`eu-central-1`	AWS region for SES and other AWS clients.
`EMAIL_SENDER_ADDRESS`	SSM	all	—	From address (must be a verified SES identity).

Scaler

Symptom	Likely cause	Check
No scaling happening	Not running or not leader	`GET voiceai:scaler:leader` in Redis.
Scale-up fails	Provider API credentials missing or rate-limited	Scaler logs for API errors; verify `HETZNER_API_TOKEN` / AWS credentials.
New instance not healthy	Cloud-init failed	SSH via bastion; `cat /var/log/cloud-init-output.log`.
Scale-down too aggressive	Evaluation interval too short	Bump `SCALING_EVALUATION_INTERVAL_MS`.
SMTP / SES emails missing	Tasker side	Scaler enqueues only; check Tasker logs.

Tasker

Symptom	Likely cause	Check
Jobs not running	Not leader (for scheduled jobs)	`GET voiceai:tasker:leader`.
Jobs stuck	Worker concurrency or Redis	`WORKER_CONCURRENCY`; verify Redis connectivity.
SES `403 / MessageRejected`	Sender unverified or IAM missing	Verify identity in SES console; check IAM policy.
SES `CredentialsProviderError`	IMDS hop limit / role	See SES section in Runbook.
SMTP failures	Squid blocking or auth wrong	Check Squid ACLs; rotate SMTP creds.

Containers​

Scaler decision loop​

Email (SMTP or SES)​

Bootstrap variables for scaled instances​

Scaler​

Tasker​

Scaler​

Tasker​

See also​

Containers

Scaler decision loop

Email (SMTP or SES)

Bootstrap variables for scaled instances

Scaler

Tasker

Scaler

Tasker

See also