Skip to content

Worker Isolation System

The defining architectural decision of 4pass is per-user worker isolation: every active user gets a dedicated ECS container running their broker session. This is not a convenience — it's a requirement driven by four hard constraints.


Why Per-User Isolation

Constraint Explanation
Broker API Binding Some broker SDKs (notably Shioaji) bind sessions to the originating IP or process. Sharing a process across users would break session affinity.
Credential Security Each worker loads exactly one user's decrypted broker credentials. No worker ever has access to another user's secrets, even in the event of a memory dump or core dump.
Fault Isolation A broker SDK crash, OOM, or hung connection kills only that user's worker. All other users are unaffected. The orchestrator restarts the failed worker within 60 seconds.
Resource Isolation Memory-hungry broker sessions are capped at 1024 MB hard limit. A runaway process triggers an OOM kill on the container only — the EC2 host and its other 29 workers continue uninterrupted.

Non-Negotiable

Shared-worker architectures fail here. A single Shioaji session consuming 800 MB would starve other users. A broker API timeout in one session would block the event loop for everyone. Per-user isolation is the only viable model.


Worker Lifecycle

stateDiagram-v2
    [*] --> PoolIdle: Lambda launches task\n(USER_ID = -1)

    PoolIdle --> Claimed: Receives pool-claim\nmessage (~332ms)

    Claimed --> Connecting: Load credentials\nInitialize broker SDK

    Connecting --> Active: Broker session\nestablished
    Connecting --> Failed: Connection error

    Failed --> Connecting: Auto-retry\n(3 attempts)
    Failed --> Terminated: Max retries exceeded

    Active --> Processing: Order received\nfrom Redis queue

    Processing --> Active: Order complete\nResponse written

    Active --> HealthCheck: 60s health check
    HealthCheck --> Active: Healthy
    HealthCheck --> Reconnecting: Connection lost
    Reconnecting --> Active: Reconnected
    Reconnecting --> Terminated: Reconnect failed

    Active --> IdleTimeout: No activity\nfor 30 minutes

    IdleTimeout --> Terminated: Graceful shutdown\nClose broker session

    Terminated --> [*]: ECS task stopped\nRedis mark expires

    state Active {
        [*] --> Heartbeat
        Heartbeat --> Heartbeat: Refresh Redis mark\nevery 5s (TTL 30s)
    }

Lifecycle Timings

State Transition Duration Mechanism
Pool idle → Claimed ~332ms SQS pool-claim message received
Claimed → Active 1–5s Credential decryption + broker handshake
Active → Processing <10ms Redis BLPOP on request queue
Processing → Active 50–500ms Broker API call + response write
Active → Idle timeout 30 min No orders or control messages
Health check interval 60s Broker session ping
Heartbeat interval 5s Redis SET with 30s TTL

Pool Pre-Warming

The pool is the key to sub-second worker readiness. Pre-warmed workers sit in the ECS cluster with USER_ID=-1, running a stripped-down event loop that does nothing but wait for a claim message on the SQS pool-claim queue.

How It Works

  1. pool_manager Lambda runs every 5 minutes via EventBridge
  2. Counts current pool workers in Redis (key pattern: worker:active:-1:* or metadata scan)
  3. Compares to target pool size (Terraform variable)
  4. Launches or terminates workers to match target

Claim Flow

  1. API determines user needs a worker → sends to worker-control.fifo
  2. worker_control Lambda checks Redis for pool availability
  3. Lambda sends claim message to pool-claim queue: { user_id, encrypted_credentials }
  4. Pool worker's BLPOP picks up the message
  5. Worker transitions from USER_ID=-1 to USER_ID={claimed_user}
  6. Worker decrypts credentials, establishes broker connection
  7. Worker sets worker:active:{user_id} in Redis with 30s TTL
  8. Worker begins heartbeat loop (5s interval)

Benchmark Comparison

Path Step Latency
Pool Claim SQS → Lambda invocation 565ms
Lambda → pool claim → worker ready 332ms
Total 897ms
RunTask (cold) SQS → Lambda invocation 565ms
Lambda → ECS RunTask → task running 3,103ms
Total 3,659ms
Cold EC2 Start No instance available → ASG launch 45,000–60,000ms
+ ECS task scheduling +3,000ms
Total 48,000–63,000ms

4× Faster with Pool

Pool pre-warming reduces user-perceived latency from 3.6 seconds to under 1 second. For a trading platform where seconds matter, this is the difference between executing at the intended price and missing the move.


Redis Queue Communication

All communication between the API and workers flows through Redis lists using a request/response pattern. There is no direct network connection between the API process and any worker.

flowchart TB
    A["FastAPI"] -->|LPUSH| RQ["Request Queue"]
    RQ -->|BLPOP| W["Worker"]
    W -->|SET| RS["Response Key"]
    RS -->|GET| A2["FastAPI polls response"]
    W -->|"SET every 5s"| HB["Heartbeat mark"]

Key Patterns

Key Type TTL Operations
trading:user:{user_id}:requests List None API: LPUSH / Worker: BLPOP (blocking, 30s timeout)
trading:response:{request_id} String 60s Worker: SET / API: GET with polling
worker:active:{user_id} String 30s Worker: SET every 5s / API & Lambda: GET
worker:metadata:{user_id} Hash 30s Worker: HSET (task_arn, instance_id, launched_at)
worker:control:{user_id}:messages List None API: LPUSH / Worker: LPOP (non-blocking check)

Heartbeat Mechanism

The heartbeat is the foundation of the worker liveness protocol:

  1. Worker calls SET worker:active:{user_id} {timestamp} EX 30 every 5 seconds
  2. If the worker dies, the key expires in 30 seconds (6 missed heartbeats)
  3. API checks the key before routing orders — if absent, triggers worker start
  4. Maintenance Lambda scans all worker:active:* keys to detect orphans

The 30s TTL with 5s refresh gives a 25-second buffer — enough to survive brief Redis connectivity blips without falsely declaring a worker dead.


Connection Management

Each worker maintains broker connections lazily — connections are created on first use and cached for the session lifetime.

Connection Key

Connections are keyed on a 4-tuple: (broker_name, broker_type, simulation, account_id). A user with both a Shioaji simulation account and a Shioaji live account gets two separate broker sessions.

Connection Lifecycle

Event Action
First order for account Create connection, authenticate with broker
Subsequent orders Reuse cached connection
Connection error Mark invalid, auto-reconnect on next order
Health check (60s) Ping broker API, verify session alive
Idle timeout (30m) Close all connections, shut down worker
Credential reload Close affected connection, reconnect with new credentials

Health Check

Every 60 seconds during idle periods, the worker pings each active broker connection:

  • If healthy: no action
  • If unhealthy: close connection, mark for reconnect on next order
  • If broker API unreachable: log warning, retry on next check

This catches stale sessions before they cause order failures.


Control Messages

Workers monitor a control message queue in addition to the order queue:

worker:control:{user_id}:messages

Currently supported control messages:

Message Action
reload_credentials Close broker connections, re-fetch and decrypt credentials from database, reconnect. No worker restart needed.
shutdown Graceful shutdown — close all connections, stop heartbeat, exit.

This enables credential rotation without service interruption. When a user updates their broker password in the dashboard, the API pushes a reload_credentials message, and the worker picks it up within seconds.


Orphan Detection & Recovery

The maintenance system runs continuously to detect and clean up inconsistencies between Redis state and ECS reality.

flowchart TB
    EB["EventBridge<br/>Every 60 seconds"] --> MT["λ maintenance<br/>(coordinator)"]

    MT -->|Scan| Redis["Valkey<br/>All worker:active:* keys"]
    MT -->|List| ECS["ECS<br/>All running tasks"]

    MT -->|Compare| Decision{Detect anomalies}

    Decision -->|Orphan marks<br/>Redis key, no ECS task| MW1["λ maintenance_worker<br/>Delete stale Redis keys"]
    Decision -->|Orphan tasks<br/>ECS task, no Redis key| MW2["λ maintenance_worker<br/>Stop ECS tasks"]
    Decision -->|Stale marks<br/>Heartbeat expired| MW3["λ maintenance_worker<br/>Clean up resources"]

    MW1 & MW2 & MW3 -->|Results| CW["CloudWatch Metrics<br/>Shioaji/Maintenance namespace"]

Anomaly Types

Anomaly Detection Resolution Cause
Orphan Mark Redis key exists, no matching ECS task Delete Redis key Task crashed without cleanup
Orphan Task ECS task running, no Redis key Stop ECS task Redis key expired (network partition)
Stale Mark Redis key TTL expired but not deleted Clean up associated keys Worker froze (deadlock, OOM)
Duplicate Workers Two tasks for same user_id Stop older task Race condition in claim flow

Fan-Out Architecture

The maintenance function uses a fan-out pattern for parallel processing:

  1. Coordinator (maintenance) scans all Redis marks and ECS tasks in one pass
  2. Partitions anomalies into batches of 100
  3. Invokes maintenance_worker Lambda for each batch (async InvokeAsync)
  4. Each worker processes its batch independently
  5. Results published to CloudWatch custom metrics

This scales linearly: at 10,000 active workers, the coordinator invokes ~100 maintenance_worker Lambdas that process in parallel, completing the full scan in under 5 seconds.


Resource Allocation

Resource Value Notes
CPU 64 units 3.125% of one vCPU. Sufficient — workers are I/O-bound (waiting on broker API responses).
Memory (soft limit) 384 MB ECS scheduling target. Normal working set is 150–250 MB.
Memory (hard limit) 1024 MB OOM kill threshold. Broker SDK memory leaks are caught here.
ECS task count per instance 30 Conservative cap (could theoretically fit 42 by soft limit).

OOM Kill Behavior

When a container exceeds 1024 MB, the Linux kernel's OOM killer terminates the container process. The EC2 instance is unaffected — all other containers continue running. The ECS agent detects the stopped task, the maintenance Lambda detects the missing heartbeat within 60 seconds, and the orchestrator launches a replacement. The user experiences a brief interruption (< 2 minutes) but no data loss — all order state is in Redis and PostgreSQL.