Compute & Orchestration¶
The compute layer is split into two planes: a data plane (ECS on EC2) that runs long-lived API servers and per-user worker processes, and a control plane (Lambda + SQS) that orchestrates worker lifecycle, background tasks, and maintenance. This separation means the orchestrator has no state to lose and the workers have no management overhead.
ECS Cluster¶
The cluster uses two capacity providers backed by separate Auto Scaling Groups, each optimized for its workload profile:
flowchart LR
subgraph cluster["ECS Cluster"]
subgraph apiCP["API Capacity Provider"]
API["m6i.large<br/>2 API tasks"]
end
subgraph workerCP["Worker Capacity Provider"]
W["r6i.large<br/>~30 workers/instance"]
end
end
subgraph control["Serverless Control Plane"]
SQS["SQS FIFO x2"] --> L["Lambda x5"]
EB["EventBridge"] --> L
end
L -->|"RunTask / Pool Claim"| W
Why Two Capacity Providers
API tasks need CPU headroom for request processing (compute-bound). Worker tasks need memory for broker SDK sessions but barely touch the CPU during idle periods (memory-bound). Mixing them on the same instance type wastes resources in both directions.
API Capacity Provider¶
| Parameter | Value |
|---|---|
| Instance Type | m6i.large (2 vCPU, 8 GB) |
| Tasks per Instance | 2 |
| CPU per Task | 1024 units (0.5 vCPU) |
| Memory per Task | 1536 MB soft / 3072 MB hard |
| Web Server | Gunicorn with 8 Uvicorn workers |
| Network Mode | bridge (dynamic host ports) |
| Deployment | Rolling update, circuit breaker enabled |
| Min Healthy % | 100% (zero-downtime deploys) |
| Max % | 200% (double capacity during deploy) |
Auto-Scaling Policies¶
The API service uses target-tracking scaling on two dimensions:
| Policy | Target | Scale-Out Cooldown | Scale-In Cooldown |
|---|---|---|---|
| CPU Utilization | 70% average | 60s | 300s |
| Request Count | 1000 req/min per target | 60s | 300s |
Scale-in is deliberately slow (300s cooldown) to avoid thrashing during intermittent traffic spikes around market open/close.
Rolling Deployment¶
Deployments use ECS rolling update with circuit breaker:
- New task definition registered
- ECS launches new tasks (up to 200% capacity)
- ALB health checks confirm new tasks are healthy
- Old tasks drain connections (300s deregistration delay)
- Old tasks stopped
If the new tasks fail health checks, the circuit breaker automatically rolls back to the previous task definition — no manual intervention required.
Worker Capacity Provider¶
| Parameter | Value |
|---|---|
| Instance Type | r6i.large (2 vCPU, 16 GB) |
| Tasks per Instance | 30 (conservative) |
| CPU per Task | 64 units |
| Memory Soft Limit | 384 MB |
| Memory Hard Limit | 1024 MB |
| Network Mode | bridge (shared host ENI) |
| Desired Count | Managed by Lambda orchestrator |
Capacity Math¶
Each r6i.large provides 2048 CPU units and 16,384 MB RAM:
| Resource | Available | Per Task | Max Tasks | Limiting? |
|---|---|---|---|---|
| CPU | 2048 units | 64 units | 2048 / 64 = 32 | No |
| Memory | 16,384 MB | 384 MB soft | 16384 / 384 = 42 | Yes (soft) |
| Memory (hard) | 16,384 MB | 1024 MB hard | 16384 / 1024 = 16 | Worst-case |
The target of 30 tasks per instance is conservative — it leaves headroom for:
- OS and ECS agent overhead (~512 MB)
- Temporary memory spikes during broker API calls
- Container runtime overhead
OOM Protection
If a worker exceeds its 1024 MB hard limit, the Linux OOM killer terminates only that container. The EC2 instance and all other workers continue unaffected. The maintenance Lambda detects the missing worker within 60 seconds and the orchestrator restarts it automatically.
Bridge Networking¶
Workers use bridge mode instead of awsvpc:
| Feature | awsvpc | bridge |
|---|---|---|
| ENI per task | Yes (1 each) | No (shared) |
| Max tasks (m/r large) | ~3 | 30+ |
| Per-task security group | Yes | No (host SG) |
| Port mapping | Static | Dynamic |
| Cost impact | ENI limits require more instances | High density, fewer instances |
The trade-off is acceptable because workers only need outbound internet to reach broker APIs. They don't receive inbound connections — all communication flows through Redis.
Lambda Orchestrator¶
Five Lambda functions form the serverless control plane:
| Function | Trigger | Timeout | Memory | Concurrency | Purpose |
|---|---|---|---|---|---|
| worker_control | SQS FIFO (worker-control) | 60s | 256 MB | 50–500 | Start, stop, claim workers. Pool assignment and RunTask fallback. |
| order_tasks | SQS FIFO (order-tasks) | 120s | 256 MB | 50–500 | Background fill verification. Query broker for order status after execution. |
| maintenance | EventBridge (every 60s) | 300s | 256 MB | 1 | Fan-out coordinator. Scans Redis for all worker marks, partitions work, invokes maintenance_worker in parallel. |
| maintenance_worker | Lambda invoke (from maintenance) | 30s | 256 MB | 100 | Process individual orphan detection batch. Check ECS task status, clean up stale marks, stop orphan tasks. |
| pool_manager | EventBridge (every 5 min) | 60s | 256 MB | — | Count pool workers, compare to target, launch or terminate to match desired pool size. |
Orchestrator Flow¶
sequenceDiagram
participant API as FastAPI
participant SQS as worker-control.fifo
participant LC as λ worker_control
participant Redis as Valkey
participant Pool as Pool Worker
participant Claim as pool-claim Queue
participant ECS as ECS RunTask
API->>SQS: Send start_worker message
SQS->>LC: Trigger Lambda
LC->>Redis: GET worker:active:{user_id}
alt Worker already active
LC-->>SQS: Delete message (no-op)
else No active worker
LC->>Redis: Check pool workers
alt Pool has available worker
LC->>Claim: Send claim message (user_id, credentials)
Claim->>Pool: Pool worker receives claim
Pool->>Redis: SET worker:active:{user_id} (TTL 30s)
Pool->>Pool: Load credentials, connect to broker
Note over Pool: ~332ms to ready
else Pool empty
LC->>ECS: RunTask (worker task definition)
ECS->>ECS: Schedule on capacity provider
Note over ECS: ~3103ms to ready
end
end
FIFO Guarantees
The worker-control.fifo queue uses user_id as the message group ID. This ensures that multiple start/stop commands for the same user are processed in order, preventing race conditions where a stop arrives before the start has completed.
SQS Queues¶
| Queue | Type | Visibility Timeout | Retention | DLQ | DLQ Max Receives | Purpose |
|---|---|---|---|---|---|---|
| worker-control.fifo | FIFO | 90s | 1 day | worker-control-dlq.fifo | 3 | Worker lifecycle commands (start, stop, claim). Message group: user_id. |
| order-tasks.fifo | FIFO | 180s | 1 day | order-tasks-dlq.fifo | 3 | Fill verification, delayed order checks. Message group: order_id. |
| pool-claim | Standard | 10s | 5 minutes | — | — | One-shot claim messages for pool workers. Short retention because unclaimed messages are stale. |
Dead Letter Queues¶
Both FIFO queues have DLQs that catch messages failing after 3 processing attempts. Both Lambda handlers use ReportBatchItemFailures so that only the specific failing record is retried — successfully processed records in the same batch are not re-delivered and do not have their receive count inflated.
| DLQ | Retention | CloudWatch Alarm | Dashboard |
|---|---|---|---|
| worker-control-dlq.fifo | 14 days | {env}-orchestrator-dlq-has-messages (> 0) |
Yes |
| order-tasks-dlq.fifo | 14 days | {env}-order-tasks-dlq-has-messages (> 0) |
Yes |
Both alarms send to the orchestrator-alerts SNS topic (email notification). The CloudWatch dashboard shows both DLQ message counts side by side.
DLQ Messages Are Genuine Failures
With ReportBatchItemFailures, only messages that truly failed 3 consecutive times reach the DLQ — no false positives from batch contamination. Common causes: Redis connectivity loss, ECS capacity exhausted, broker API persistently timing out. Action: check the corresponding Lambda error in CloudWatch Logs, fix the root cause, then redrive messages from the DLQ back to the main queue.