Skip to content

Compute & Orchestration

The compute layer is split into two planes: a data plane (ECS on EC2) that runs long-lived API servers and per-user worker processes, and a control plane (Lambda + SQS) that orchestrates worker lifecycle, background tasks, and maintenance. This separation means the orchestrator has no state to lose and the workers have no management overhead.


ECS Cluster

The cluster uses two capacity providers backed by separate Auto Scaling Groups, each optimized for its workload profile:

flowchart LR
    subgraph cluster["ECS Cluster"]
        subgraph apiCP["API Capacity Provider"]
            API["m6i.large<br/>2 API tasks"]
        end
        subgraph workerCP["Worker Capacity Provider"]
            W["r6i.large<br/>~30 workers/instance"]
        end
    end

    subgraph control["Serverless Control Plane"]
        SQS["SQS FIFO x2"] --> L["Lambda x5"]
        EB["EventBridge"] --> L
    end

    L -->|"RunTask / Pool Claim"| W

Why Two Capacity Providers

API tasks need CPU headroom for request processing (compute-bound). Worker tasks need memory for broker SDK sessions but barely touch the CPU during idle periods (memory-bound). Mixing them on the same instance type wastes resources in both directions.


API Capacity Provider

Parameter Value
Instance Type m6i.large (2 vCPU, 8 GB)
Tasks per Instance 2
CPU per Task 1024 units (0.5 vCPU)
Memory per Task 1536 MB soft / 3072 MB hard
Web Server Gunicorn with 8 Uvicorn workers
Network Mode bridge (dynamic host ports)
Deployment Rolling update, circuit breaker enabled
Min Healthy % 100% (zero-downtime deploys)
Max % 200% (double capacity during deploy)

Auto-Scaling Policies

The API service uses target-tracking scaling on two dimensions:

Policy Target Scale-Out Cooldown Scale-In Cooldown
CPU Utilization 70% average 60s 300s
Request Count 1000 req/min per target 60s 300s

Scale-in is deliberately slow (300s cooldown) to avoid thrashing during intermittent traffic spikes around market open/close.

Rolling Deployment

Deployments use ECS rolling update with circuit breaker:

  1. New task definition registered
  2. ECS launches new tasks (up to 200% capacity)
  3. ALB health checks confirm new tasks are healthy
  4. Old tasks drain connections (300s deregistration delay)
  5. Old tasks stopped

If the new tasks fail health checks, the circuit breaker automatically rolls back to the previous task definition — no manual intervention required.


Worker Capacity Provider

Parameter Value
Instance Type r6i.large (2 vCPU, 16 GB)
Tasks per Instance 30 (conservative)
CPU per Task 64 units
Memory Soft Limit 384 MB
Memory Hard Limit 1024 MB
Network Mode bridge (shared host ENI)
Desired Count Managed by Lambda orchestrator

Capacity Math

Each r6i.large provides 2048 CPU units and 16,384 MB RAM:

Resource Available Per Task Max Tasks Limiting?
CPU 2048 units 64 units 2048 / 64 = 32 No
Memory 16,384 MB 384 MB soft 16384 / 384 = 42 Yes (soft)
Memory (hard) 16,384 MB 1024 MB hard 16384 / 1024 = 16 Worst-case

The target of 30 tasks per instance is conservative — it leaves headroom for:

  • OS and ECS agent overhead (~512 MB)
  • Temporary memory spikes during broker API calls
  • Container runtime overhead

OOM Protection

If a worker exceeds its 1024 MB hard limit, the Linux OOM killer terminates only that container. The EC2 instance and all other workers continue unaffected. The maintenance Lambda detects the missing worker within 60 seconds and the orchestrator restarts it automatically.

Bridge Networking

Workers use bridge mode instead of awsvpc:

Feature awsvpc bridge
ENI per task Yes (1 each) No (shared)
Max tasks (m/r large) ~3 30+
Per-task security group Yes No (host SG)
Port mapping Static Dynamic
Cost impact ENI limits require more instances High density, fewer instances

The trade-off is acceptable because workers only need outbound internet to reach broker APIs. They don't receive inbound connections — all communication flows through Redis.


Lambda Orchestrator

Five Lambda functions form the serverless control plane:

Function Trigger Timeout Memory Concurrency Purpose
worker_control SQS FIFO (worker-control) 60s 256 MB 50–500 Start, stop, claim workers. Pool assignment and RunTask fallback.
order_tasks SQS FIFO (order-tasks) 120s 256 MB 50–500 Background fill verification. Query broker for order status after execution.
maintenance EventBridge (every 60s) 300s 256 MB 1 Fan-out coordinator. Scans Redis for all worker marks, partitions work, invokes maintenance_worker in parallel.
maintenance_worker Lambda invoke (from maintenance) 30s 256 MB 100 Process individual orphan detection batch. Check ECS task status, clean up stale marks, stop orphan tasks.
pool_manager EventBridge (every 5 min) 60s 256 MB Count pool workers, compare to target, launch or terminate to match desired pool size.

Orchestrator Flow

sequenceDiagram
    participant API as FastAPI
    participant SQS as worker-control.fifo
    participant LC as λ worker_control
    participant Redis as Valkey
    participant Pool as Pool Worker
    participant Claim as pool-claim Queue
    participant ECS as ECS RunTask

    API->>SQS: Send start_worker message
    SQS->>LC: Trigger Lambda
    LC->>Redis: GET worker:active:{user_id}
    alt Worker already active
        LC-->>SQS: Delete message (no-op)
    else No active worker
        LC->>Redis: Check pool workers
        alt Pool has available worker
            LC->>Claim: Send claim message (user_id, credentials)
            Claim->>Pool: Pool worker receives claim
            Pool->>Redis: SET worker:active:{user_id} (TTL 30s)
            Pool->>Pool: Load credentials, connect to broker
            Note over Pool: ~332ms to ready
        else Pool empty
            LC->>ECS: RunTask (worker task definition)
            ECS->>ECS: Schedule on capacity provider
            Note over ECS: ~3103ms to ready
        end
    end

FIFO Guarantees

The worker-control.fifo queue uses user_id as the message group ID. This ensures that multiple start/stop commands for the same user are processed in order, preventing race conditions where a stop arrives before the start has completed.


SQS Queues

Queue Type Visibility Timeout Retention DLQ DLQ Max Receives Purpose
worker-control.fifo FIFO 90s 1 day worker-control-dlq.fifo 3 Worker lifecycle commands (start, stop, claim). Message group: user_id.
order-tasks.fifo FIFO 180s 1 day order-tasks-dlq.fifo 3 Fill verification, delayed order checks. Message group: order_id.
pool-claim Standard 10s 5 minutes One-shot claim messages for pool workers. Short retention because unclaimed messages are stale.

Dead Letter Queues

Both FIFO queues have DLQs that catch messages failing after 3 processing attempts. Both Lambda handlers use ReportBatchItemFailures so that only the specific failing record is retried — successfully processed records in the same batch are not re-delivered and do not have their receive count inflated.

DLQ Retention CloudWatch Alarm Dashboard
worker-control-dlq.fifo 14 days {env}-orchestrator-dlq-has-messages (> 0) Yes
order-tasks-dlq.fifo 14 days {env}-order-tasks-dlq-has-messages (> 0) Yes

Both alarms send to the orchestrator-alerts SNS topic (email notification). The CloudWatch dashboard shows both DLQ message counts side by side.

DLQ Messages Are Genuine Failures

With ReportBatchItemFailures, only messages that truly failed 3 consecutive times reach the DLQ — no false positives from batch contamination. Common causes: Redis connectivity loss, ECS capacity exhausted, broker API persistently timing out. Action: check the corresponding Lambda error in CloudWatch Logs, fix the root cause, then redrive messages from the DLQ back to the main queue.