Worker Isolation System¶

The defining architectural decision of 4pass is per-user worker isolation: every active user gets a dedicated ECS container running their broker session. This is not a convenience — it's a requirement driven by four hard constraints.

Why Per-User Isolation¶

Constraint	Explanation
Broker API Binding	Some broker SDKs (notably Shioaji) bind sessions to the originating IP or process. Sharing a process across users would break session affinity.
Credential Security	Each worker loads exactly one user's decrypted broker credentials. No worker ever has access to another user's secrets, even in the event of a memory dump or core dump.
Fault Isolation	A broker SDK crash, OOM, or hung connection kills only that user's worker. All other users are unaffected. The orchestrator restarts the failed worker within 60 seconds.
Resource Isolation	Memory-hungry broker sessions are capped at 1024 MB hard limit. A runaway process triggers an OOM kill on the container only — the EC2 host and its other 29 workers continue uninterrupted.

Non-Negotiable

Shared-worker architectures fail here. A single Shioaji session consuming 800 MB would starve other users. A broker API timeout in one session would block the event loop for everyone. Per-user isolation is the only viable model.

Worker Lifecycle¶

stateDiagram-v2
    [*] --> PoolIdle: Lambda launches task\n(USER_ID = -1)

    PoolIdle --> Claimed: Receives pool-claim\nmessage (~332ms)

    Claimed --> Connecting: Load credentials\nInitialize broker SDK

    Connecting --> Active: Broker session\nestablished
    Connecting --> Failed: Connection error

    Failed --> Connecting: Auto-retry\n(3 attempts)
    Failed --> Terminated: Max retries exceeded

    Active --> Processing: Order received\nfrom Redis queue

    Processing --> Active: Order complete\nResponse written

    Active --> HealthCheck: 60s health check
    HealthCheck --> Active: Healthy
    HealthCheck --> Reconnecting: Connection lost
    Reconnecting --> Active: Reconnected
    Reconnecting --> Terminated: Reconnect failed

    Active --> IdleTimeout: No activity\nfor 30 minutes

    IdleTimeout --> Terminated: Graceful shutdown\nClose broker session

    Terminated --> [*]: ECS task stopped\nRedis mark expires

    state Active {
        [*] --> Heartbeat
        Heartbeat --> Heartbeat: Refresh Redis mark\nevery 5s (TTL 30s)
    }

Lifecycle Timings¶

State Transition	Duration	Mechanism
Pool idle → Claimed	~332ms	SQS `pool-claim` message received
Claimed → Active	1–5s	Credential decryption + broker handshake
Active → Processing	<10ms	Redis BLPOP on request queue
Processing → Active	50–500ms	Broker API call + response write
Active → Idle timeout	30 min	No orders or control messages
Health check interval	60s	Broker session ping
Heartbeat interval	5s	Redis SET with 30s TTL

Pool Pre-Warming¶

The pool is the key to sub-second worker readiness. Pre-warmed workers sit in the ECS cluster with USER_ID=-1, running a stripped-down event loop that does nothing but wait for a claim message on the SQS pool-claim queue.

How It Works¶

pool_manager Lambda runs every 5 minutes via EventBridge
Counts current pool workers in Redis (key pattern: worker:active:-1:* or metadata scan)
Compares to target pool size (Terraform variable)
Launches or terminates workers to match target

Claim Flow¶

API determines user needs a worker → sends to worker-control.fifo
worker_control Lambda checks Redis for pool availability
Lambda sends claim message to pool-claim queue: { user_id, encrypted_credentials }
Pool worker's BLPOP picks up the message
Worker transitions from USER_ID=-1 to USER_ID={claimed_user}
Worker decrypts credentials, establishes broker connection
Worker sets worker:active:{user_id} in Redis with 30s TTL
Worker begins heartbeat loop (5s interval)

Benchmark Comparison¶

Path	Step	Latency
Pool Claim	SQS → Lambda invocation	565ms
	Lambda → pool claim → worker ready	332ms
	Total	897ms
RunTask (cold)	SQS → Lambda invocation	565ms
	Lambda → ECS RunTask → task running	3,103ms
	Total	3,659ms
Cold EC2 Start	No instance available → ASG launch	45,000–60,000ms
	+ ECS task scheduling	+3,000ms
	Total	48,000–63,000ms

4× Faster with Pool

Pool pre-warming reduces user-perceived latency from 3.6 seconds to under 1 second. For a trading platform where seconds matter, this is the difference between executing at the intended price and missing the move.

Redis Queue Communication¶

All communication between the API and workers flows through Redis lists using a request/response pattern. There is no direct network connection between the API process and any worker.

flowchart TB
    A["FastAPI"] -->|LPUSH| RQ["Request Queue"]
    RQ -->|BLPOP| W["Worker"]
    W -->|SET| RS["Response Key"]
    RS -->|GET| A2["FastAPI polls response"]
    W -->|"SET every 5s"| HB["Heartbeat mark"]

Key Patterns¶

Key	Type	TTL	Operations
`trading:user:{user_id}:requests`	List	None	API: LPUSH / Worker: BLPOP (blocking, 30s timeout)
`trading:response:{request_id}`	String	60s	Worker: SET / API: GET with polling
`worker:active:{user_id}`	String	30s	Worker: SET every 5s / API & Lambda: GET
`worker:metadata:{user_id}`	Hash	30s	Worker: HSET (task_arn, instance_id, launched_at)
`worker:control:{user_id}:messages`	List	None	API: LPUSH / Worker: LPOP (non-blocking check)

Heartbeat Mechanism¶

The heartbeat is the foundation of the worker liveness protocol:

Worker calls SET worker:active:{user_id} {timestamp} EX 30 every 5 seconds
If the worker dies, the key expires in 30 seconds (6 missed heartbeats)
API checks the key before routing orders — if absent, triggers worker start
Maintenance Lambda scans all worker:active:* keys to detect orphans

The 30s TTL with 5s refresh gives a 25-second buffer — enough to survive brief Redis connectivity blips without falsely declaring a worker dead.

Connection Management¶

Each worker maintains broker connections lazily — connections are created on first use and cached for the session lifetime.

Connection Key¶

Connections are keyed on a 4-tuple: (broker_name, broker_type, simulation, account_id). A user with both a Shioaji simulation account and a Shioaji live account gets two separate broker sessions.

Connection Lifecycle¶

Event	Action
First order for account	Create connection, authenticate with broker
Subsequent orders	Reuse cached connection
Connection error	Mark invalid, auto-reconnect on next order
Health check (60s)	Ping broker API, verify session alive
Idle timeout (30m)	Close all connections, shut down worker
Credential reload	Close affected connection, reconnect with new credentials

Health Check¶

Every 60 seconds during idle periods, the worker pings each active broker connection:

If healthy: no action
If unhealthy: close connection, mark for reconnect on next order
If broker API unreachable: log warning, retry on next check

This catches stale sessions before they cause order failures.

Control Messages¶

Workers monitor a control message queue in addition to the order queue:

worker:control:{user_id}:messages

Currently supported control messages:

Message	Action
`reload_credentials`	Close broker connections, re-fetch and decrypt credentials from database, reconnect. No worker restart needed.
`shutdown`	Graceful shutdown — close all connections, stop heartbeat, exit.

This enables credential rotation without service interruption. When a user updates their broker password in the dashboard, the API pushes a reload_credentials message, and the worker picks it up within seconds.

Orphan Detection & Recovery¶

The maintenance system runs continuously to detect and clean up inconsistencies between Redis state and ECS reality.

flowchart TB
    EB["EventBridge<br/>Every 60 seconds"] --> MT["λ maintenance<br/>(coordinator)"]

    MT -->|Scan| Redis["Valkey<br/>All worker:active:* keys"]
    MT -->|List| ECS["ECS<br/>All running tasks"]

    MT -->|Compare| Decision{Detect anomalies}

    Decision -->|Orphan marks<br/>Redis key, no ECS task| MW1["λ maintenance_worker<br/>Delete stale Redis keys"]
    Decision -->|Orphan tasks<br/>ECS task, no Redis key| MW2["λ maintenance_worker<br/>Stop ECS tasks"]
    Decision -->|Stale marks<br/>Heartbeat expired| MW3["λ maintenance_worker<br/>Clean up resources"]

    MW1 & MW2 & MW3 -->|Results| CW["CloudWatch Metrics<br/>Shioaji/Maintenance namespace"]

Anomaly Types¶

Anomaly	Detection	Resolution	Cause
Orphan Mark	Redis key exists, no matching ECS task	Delete Redis key	Task crashed without cleanup
Orphan Task	ECS task running, no Redis key	Stop ECS task	Redis key expired (network partition)
Stale Mark	Redis key TTL expired but not deleted	Clean up associated keys	Worker froze (deadlock, OOM)
Duplicate Workers	Two tasks for same user_id	Stop older task	Race condition in claim flow

Fan-Out Architecture¶

The maintenance function uses a fan-out pattern for parallel processing:

Coordinator (maintenance) scans all Redis marks and ECS tasks in one pass
Partitions anomalies into batches of 100
Invokes maintenance_worker Lambda for each batch (async InvokeAsync)
Each worker processes its batch independently
Results published to CloudWatch custom metrics

This scales linearly: at 10,000 active workers, the coordinator invokes ~100 maintenance_worker Lambdas that process in parallel, completing the full scan in under 5 seconds.

Resource Allocation¶

Resource	Value	Notes
CPU	64 units	3.125% of one vCPU. Sufficient — workers are I/O-bound (waiting on broker API responses).
Memory (soft limit)	384 MB	ECS scheduling target. Normal working set is 150–250 MB.
Memory (hard limit)	1024 MB	OOM kill threshold. Broker SDK memory leaks are caught here.
ECS task count per instance	30	Conservative cap (could theoretically fit 42 by soft limit).

OOM Kill Behavior

When a container exceeds 1024 MB, the Linux kernel's OOM killer terminates the container process. The EC2 instance is unaffected — all other containers continue running. The ECS agent detects the stopped task, the maintenance Lambda detects the missing heartbeat within 60 seconds, and the orchestrator launches a replacement. The user experiences a brief interruption (< 2 minutes) but no data loss — all order state is in Redis and PostgreSQL.