Worker Isolation System¶
The defining architectural decision of 4pass is per-user worker isolation: every active user gets a dedicated ECS container running their broker session. This is not a convenience — it's a requirement driven by four hard constraints.
Why Per-User Isolation¶
| Constraint | Explanation |
|---|---|
| Broker API Binding | Some broker SDKs (notably Shioaji) bind sessions to the originating IP or process. Sharing a process across users would break session affinity. |
| Credential Security | Each worker loads exactly one user's decrypted broker credentials. No worker ever has access to another user's secrets, even in the event of a memory dump or core dump. |
| Fault Isolation | A broker SDK crash, OOM, or hung connection kills only that user's worker. All other users are unaffected. The orchestrator restarts the failed worker within 60 seconds. |
| Resource Isolation | Memory-hungry broker sessions are capped at 1024 MB hard limit. A runaway process triggers an OOM kill on the container only — the EC2 host and its other 29 workers continue uninterrupted. |
Non-Negotiable
Shared-worker architectures fail here. A single Shioaji session consuming 800 MB would starve other users. A broker API timeout in one session would block the event loop for everyone. Per-user isolation is the only viable model.
Worker Lifecycle¶
stateDiagram-v2
[*] --> PoolIdle: Lambda launches task\n(USER_ID = -1)
PoolIdle --> Claimed: Receives pool-claim\nmessage (~332ms)
Claimed --> Connecting: Load credentials\nInitialize broker SDK
Connecting --> Active: Broker session\nestablished
Connecting --> Failed: Connection error
Failed --> Connecting: Auto-retry\n(3 attempts)
Failed --> Terminated: Max retries exceeded
Active --> Processing: Order received\nfrom Redis queue
Processing --> Active: Order complete\nResponse written
Active --> HealthCheck: 60s health check
HealthCheck --> Active: Healthy
HealthCheck --> Reconnecting: Connection lost
Reconnecting --> Active: Reconnected
Reconnecting --> Terminated: Reconnect failed
Active --> IdleTimeout: No activity\nfor 30 minutes
IdleTimeout --> Terminated: Graceful shutdown\nClose broker session
Terminated --> [*]: ECS task stopped\nRedis mark expires
state Active {
[*] --> Heartbeat
Heartbeat --> Heartbeat: Refresh Redis mark\nevery 5s (TTL 30s)
}
Lifecycle Timings¶
| State Transition | Duration | Mechanism |
|---|---|---|
| Pool idle → Claimed | ~332ms | SQS pool-claim message received |
| Claimed → Active | 1–5s | Credential decryption + broker handshake |
| Active → Processing | <10ms | Redis BLPOP on request queue |
| Processing → Active | 50–500ms | Broker API call + response write |
| Active → Idle timeout | 30 min | No orders or control messages |
| Health check interval | 60s | Broker session ping |
| Heartbeat interval | 5s | Redis SET with 30s TTL |
Pool Pre-Warming¶
The pool is the key to sub-second worker readiness. Pre-warmed workers sit in the ECS cluster with USER_ID=-1, running a stripped-down event loop that does nothing but wait for a claim message on the SQS pool-claim queue.
How It Works¶
- pool_manager Lambda runs every 5 minutes via EventBridge
- Counts current pool workers in Redis (key pattern:
worker:active:-1:*or metadata scan) - Compares to target pool size (Terraform variable)
- Launches or terminates workers to match target
Claim Flow¶
- API determines user needs a worker → sends to
worker-control.fifo - worker_control Lambda checks Redis for pool availability
- Lambda sends claim message to
pool-claimqueue:{ user_id, encrypted_credentials } - Pool worker's
BLPOPpicks up the message - Worker transitions from
USER_ID=-1toUSER_ID={claimed_user} - Worker decrypts credentials, establishes broker connection
- Worker sets
worker:active:{user_id}in Redis with 30s TTL - Worker begins heartbeat loop (5s interval)
Benchmark Comparison¶
| Path | Step | Latency |
|---|---|---|
| Pool Claim | SQS → Lambda invocation | 565ms |
| Lambda → pool claim → worker ready | 332ms | |
| Total | 897ms | |
| RunTask (cold) | SQS → Lambda invocation | 565ms |
| Lambda → ECS RunTask → task running | 3,103ms | |
| Total | 3,659ms | |
| Cold EC2 Start | No instance available → ASG launch | 45,000–60,000ms |
| + ECS task scheduling | +3,000ms | |
| Total | 48,000–63,000ms |
4× Faster with Pool
Pool pre-warming reduces user-perceived latency from 3.6 seconds to under 1 second. For a trading platform where seconds matter, this is the difference between executing at the intended price and missing the move.
Redis Queue Communication¶
All communication between the API and workers flows through Redis lists using a request/response pattern. There is no direct network connection between the API process and any worker.
flowchart TB
A["FastAPI"] -->|LPUSH| RQ["Request Queue"]
RQ -->|BLPOP| W["Worker"]
W -->|SET| RS["Response Key"]
RS -->|GET| A2["FastAPI polls response"]
W -->|"SET every 5s"| HB["Heartbeat mark"]
Key Patterns¶
| Key | Type | TTL | Operations |
|---|---|---|---|
trading:user:{user_id}:requests |
List | None | API: LPUSH / Worker: BLPOP (blocking, 30s timeout) |
trading:response:{request_id} |
String | 60s | Worker: SET / API: GET with polling |
worker:active:{user_id} |
String | 30s | Worker: SET every 5s / API & Lambda: GET |
worker:metadata:{user_id} |
Hash | 30s | Worker: HSET (task_arn, instance_id, launched_at) |
worker:control:{user_id}:messages |
List | None | API: LPUSH / Worker: LPOP (non-blocking check) |
Heartbeat Mechanism¶
The heartbeat is the foundation of the worker liveness protocol:
- Worker calls
SET worker:active:{user_id} {timestamp} EX 30every 5 seconds - If the worker dies, the key expires in 30 seconds (6 missed heartbeats)
- API checks the key before routing orders — if absent, triggers worker start
- Maintenance Lambda scans all
worker:active:*keys to detect orphans
The 30s TTL with 5s refresh gives a 25-second buffer — enough to survive brief Redis connectivity blips without falsely declaring a worker dead.
Connection Management¶
Each worker maintains broker connections lazily — connections are created on first use and cached for the session lifetime.
Connection Key¶
Connections are keyed on a 4-tuple: (broker_name, broker_type, simulation, account_id). A user with both a Shioaji simulation account and a Shioaji live account gets two separate broker sessions.
Connection Lifecycle¶
| Event | Action |
|---|---|
| First order for account | Create connection, authenticate with broker |
| Subsequent orders | Reuse cached connection |
| Connection error | Mark invalid, auto-reconnect on next order |
| Health check (60s) | Ping broker API, verify session alive |
| Idle timeout (30m) | Close all connections, shut down worker |
| Credential reload | Close affected connection, reconnect with new credentials |
Health Check¶
Every 60 seconds during idle periods, the worker pings each active broker connection:
- If healthy: no action
- If unhealthy: close connection, mark for reconnect on next order
- If broker API unreachable: log warning, retry on next check
This catches stale sessions before they cause order failures.
Control Messages¶
Workers monitor a control message queue in addition to the order queue:
Currently supported control messages:
| Message | Action |
|---|---|
reload_credentials |
Close broker connections, re-fetch and decrypt credentials from database, reconnect. No worker restart needed. |
shutdown |
Graceful shutdown — close all connections, stop heartbeat, exit. |
This enables credential rotation without service interruption. When a user updates their broker password in the dashboard, the API pushes a reload_credentials message, and the worker picks it up within seconds.
Orphan Detection & Recovery¶
The maintenance system runs continuously to detect and clean up inconsistencies between Redis state and ECS reality.
flowchart TB
EB["EventBridge<br/>Every 60 seconds"] --> MT["λ maintenance<br/>(coordinator)"]
MT -->|Scan| Redis["Valkey<br/>All worker:active:* keys"]
MT -->|List| ECS["ECS<br/>All running tasks"]
MT -->|Compare| Decision{Detect anomalies}
Decision -->|Orphan marks<br/>Redis key, no ECS task| MW1["λ maintenance_worker<br/>Delete stale Redis keys"]
Decision -->|Orphan tasks<br/>ECS task, no Redis key| MW2["λ maintenance_worker<br/>Stop ECS tasks"]
Decision -->|Stale marks<br/>Heartbeat expired| MW3["λ maintenance_worker<br/>Clean up resources"]
MW1 & MW2 & MW3 -->|Results| CW["CloudWatch Metrics<br/>Shioaji/Maintenance namespace"]
Anomaly Types¶
| Anomaly | Detection | Resolution | Cause |
|---|---|---|---|
| Orphan Mark | Redis key exists, no matching ECS task | Delete Redis key | Task crashed without cleanup |
| Orphan Task | ECS task running, no Redis key | Stop ECS task | Redis key expired (network partition) |
| Stale Mark | Redis key TTL expired but not deleted | Clean up associated keys | Worker froze (deadlock, OOM) |
| Duplicate Workers | Two tasks for same user_id | Stop older task | Race condition in claim flow |
Fan-Out Architecture¶
The maintenance function uses a fan-out pattern for parallel processing:
- Coordinator (
maintenance) scans all Redis marks and ECS tasks in one pass - Partitions anomalies into batches of 100
- Invokes
maintenance_workerLambda for each batch (asyncInvokeAsync) - Each worker processes its batch independently
- Results published to CloudWatch custom metrics
This scales linearly: at 10,000 active workers, the coordinator invokes ~100 maintenance_worker Lambdas that process in parallel, completing the full scan in under 5 seconds.
Resource Allocation¶
| Resource | Value | Notes |
|---|---|---|
| CPU | 64 units | 3.125% of one vCPU. Sufficient — workers are I/O-bound (waiting on broker API responses). |
| Memory (soft limit) | 384 MB | ECS scheduling target. Normal working set is 150–250 MB. |
| Memory (hard limit) | 1024 MB | OOM kill threshold. Broker SDK memory leaks are caught here. |
| ECS task count per instance | 30 | Conservative cap (could theoretically fit 42 by soft limit). |
OOM Kill Behavior
When a container exceeds 1024 MB, the Linux kernel's OOM killer terminates the container process. The EC2 instance is unaffected — all other containers continue running. The ECS agent detects the stopped task, the maintenance Lambda detects the missing heartbeat within 60 seconds, and the orchestrator launches a replacement. The user experiences a brief interruption (< 2 minutes) but no data loss — all order state is in Redis and PostgreSQL.