Compute & Orchestration¶

The compute layer is split into two planes: a data plane (ECS on EC2) that runs long-lived API servers and per-user worker processes, and a control plane (Lambda + SQS) that orchestrates worker lifecycle, background tasks, and maintenance. This separation means the orchestrator has no state to lose and the workers have no management overhead.

ECS Cluster¶

The cluster uses two capacity providers backed by separate Auto Scaling Groups, each optimized for its workload profile:

flowchart LR
    subgraph cluster["ECS Cluster"]
        subgraph apiCP["API Capacity Provider"]
            API["m6i.large<br/>2 API tasks"]
        end
        subgraph workerCP["Worker Capacity Provider"]
            W["r6i.large<br/>~30 workers/instance"]
        end
    end

    subgraph control["Serverless Control Plane"]
        SQS["SQS FIFO x2"] --> L["Lambda x5"]
        EB["EventBridge"] --> L
    end

    L -->|"RunTask / Pool Claim"| W

Why Two Capacity Providers

API tasks need CPU headroom for request processing (compute-bound). Worker tasks need memory for broker SDK sessions but barely touch the CPU during idle periods (memory-bound). Mixing them on the same instance type wastes resources in both directions.

API Capacity Provider¶

Parameter	Value
Instance Type	m6i.large (2 vCPU, 8 GB)
Tasks per Instance	2
CPU per Task	1024 units (0.5 vCPU)
Memory per Task	1536 MB soft / 3072 MB hard
Web Server	Gunicorn with 8 Uvicorn workers
Network Mode	bridge (dynamic host ports)
Deployment	Rolling update, circuit breaker enabled
Min Healthy %	100% (zero-downtime deploys)
Max %	200% (double capacity during deploy)

Auto-Scaling Policies¶

The API service uses target-tracking scaling on two dimensions:

Policy	Target	Scale-Out Cooldown	Scale-In Cooldown
CPU Utilization	70% average	60s	300s
Request Count	1000 req/min per target	60s	300s

Scale-in is deliberately slow (300s cooldown) to avoid thrashing during intermittent traffic spikes around market open/close.

Rolling Deployment¶

Deployments use ECS rolling update with circuit breaker:

New task definition registered
ECS launches new tasks (up to 200% capacity)
ALB health checks confirm new tasks are healthy
Old tasks drain connections (300s deregistration delay)
Old tasks stopped

If the new tasks fail health checks, the circuit breaker automatically rolls back to the previous task definition — no manual intervention required.

Worker Capacity Provider¶

Parameter	Value
Instance Type	r6i.large (2 vCPU, 16 GB)
Tasks per Instance	30 (conservative)
CPU per Task	64 units
Memory Soft Limit	384 MB
Memory Hard Limit	1024 MB
Network Mode	bridge (shared host ENI)
Desired Count	Managed by Lambda orchestrator

Capacity Math¶

Each r6i.large provides 2048 CPU units and 16,384 MB RAM:

Resource	Available	Per Task	Max Tasks	Limiting?
CPU	2048 units	64 units	2048 / 64 = 32	No
Memory	16,384 MB	384 MB soft	16384 / 384 = 42	Yes (soft)
Memory (hard)	16,384 MB	1024 MB hard	16384 / 1024 = 16	Worst-case

The target of 30 tasks per instance is conservative — it leaves headroom for:

OS and ECS agent overhead (~512 MB)
Temporary memory spikes during broker API calls
Container runtime overhead

OOM Protection

If a worker exceeds its 1024 MB hard limit, the Linux OOM killer terminates only that container. The EC2 instance and all other workers continue unaffected. The maintenance Lambda detects the missing worker within 60 seconds and the orchestrator restarts it automatically.

Bridge Networking¶

Workers use bridge mode instead of awsvpc:

Feature	awsvpc	bridge
ENI per task	Yes (1 each)	No (shared)
Max tasks (m/r large)	~3	30+
Per-task security group	Yes	No (host SG)
Port mapping	Static	Dynamic
Cost impact	ENI limits require more instances	High density, fewer instances

The trade-off is acceptable because workers only need outbound internet to reach broker APIs. They don't receive inbound connections — all communication flows through Redis.

Lambda Orchestrator¶

Five Lambda functions form the serverless control plane:

Function	Trigger	Timeout	Memory	Concurrency	Purpose
worker_control	SQS FIFO (worker-control)	60s	256 MB	50–500	Start, stop, claim workers. Pool assignment and RunTask fallback.
order_tasks	SQS FIFO (order-tasks)	120s	256 MB	50–500	Background fill verification. Query broker for order status after execution.
maintenance	EventBridge (every 60s)	300s	256 MB	1	Fan-out coordinator. Scans Redis for all worker marks, partitions work, invokes maintenance_worker in parallel.
maintenance_worker	Lambda invoke (from maintenance)	30s	256 MB	100	Process individual orphan detection batch. Check ECS task status, clean up stale marks, stop orphan tasks.
pool_manager	EventBridge (every 5 min)	60s	256 MB	—	Count pool workers, compare to target, launch or terminate to match desired pool size.

Orchestrator Flow¶

sequenceDiagram
    participant API as FastAPI
    participant SQS as worker-control.fifo
    participant LC as λ worker_control
    participant Redis as Valkey
    participant Pool as Pool Worker
    participant Claim as pool-claim Queue
    participant ECS as ECS RunTask

    API->>SQS: Send start_worker message
    SQS->>LC: Trigger Lambda
    LC->>Redis: GET worker:active:{user_id}
    alt Worker already active
        LC-->>SQS: Delete message (no-op)
    else No active worker
        LC->>Redis: Check pool workers
        alt Pool has available worker
            LC->>Claim: Send claim message (user_id, credentials)
            Claim->>Pool: Pool worker receives claim
            Pool->>Redis: SET worker:active:{user_id} (TTL 30s)
            Pool->>Pool: Load credentials, connect to broker
            Note over Pool: ~332ms to ready
        else Pool empty
            LC->>ECS: RunTask (worker task definition)
            ECS->>ECS: Schedule on capacity provider
            Note over ECS: ~3103ms to ready
        end
    end

FIFO Guarantees

The worker-control.fifo queue uses user_id as the message group ID. This ensures that multiple start/stop commands for the same user are processed in order, preventing race conditions where a stop arrives before the start has completed.

SQS Queues¶

Queue	Type	Visibility Timeout	Retention	DLQ	DLQ Max Receives	Purpose
worker-control.fifo	FIFO	90s	1 day	worker-control-dlq.fifo	3	Worker lifecycle commands (start, stop, claim). Message group: user_id.
order-tasks.fifo	FIFO	180s	1 day	order-tasks-dlq.fifo	3	Fill verification, delayed order checks. Message group: order_id.
pool-claim	Standard	10s	5 minutes	—	—	One-shot claim messages for pool workers. Short retention because unclaimed messages are stale.

Dead Letter Queues¶

Both FIFO queues have DLQs that catch messages failing after 3 processing attempts. Both Lambda handlers use ReportBatchItemFailures so that only the specific failing record is retried — successfully processed records in the same batch are not re-delivered and do not have their receive count inflated.

DLQ	Retention	CloudWatch Alarm	Dashboard
worker-control-dlq.fifo	14 days	`{env}-orchestrator-dlq-has-messages` (> 0)	Yes
order-tasks-dlq.fifo	14 days	`{env}-order-tasks-dlq-has-messages` (> 0)	Yes

Both alarms send to the orchestrator-alerts SNS topic (email notification). The CloudWatch dashboard shows both DLQ message counts side by side.

DLQ Messages Are Genuine Failures

With ReportBatchItemFailures, only messages that truly failed 3 consecutive times reach the DLQ — no false positives from batch contamination. Common causes: Redis connectivity loss, ECS capacity exhausted, broker API persistently timing out. Action: check the corresponding Lambda error in CloudWatch Logs, fix the root cause, then redrive messages from the DLQ back to the main queue.