Skip to content

Monitoring & Observability

Every component in the platform emits metrics, logs, and traces. The monitoring stack is built entirely on CloudWatch — no third-party agents, no additional infrastructure to manage. Alarms trigger SNS notifications for immediate attention, and a centralized dashboard provides a single-pane view of system health.


CloudWatch Metrics

Container Insights

ECS Container Insights is enabled on the cluster, providing per-task and per-service metrics without any application-level instrumentation:

Metric Granularity What It Shows
CPU Utilization Per task, per service Processing load. API tasks should stay < 70%. Worker tasks typically < 5%.
Memory Utilization Per task, per service Memory pressure. Workers at > 80% of soft limit (384 MB) indicate broker SDK memory growth.
Network RX/TX Per task Traffic volume. Sudden spikes may indicate webhook floods or broker API retries.
Task Count Per service Running, pending, desired count. Divergence between desired and running indicates capacity issues.
Storage Read/Write Per task Disk I/O. Should be near-zero for workers (all state in Redis/RDS).

Custom Metrics — Shioaji/Maintenance Namespace

The maintenance Lambda publishes custom metrics after every scan cycle:

Metric Unit Description
ActiveWorkerCount Count Total active workers across all users
OrphanMarksDetected Count Redis keys with no matching ECS task
OrphanTasksDetected Count ECS tasks with no matching Redis key
StaleMarksDetected Count Redis keys past TTL that weren't cleaned
WorkerLaunchLatency Milliseconds Time from start command to worker ready
PoolWorkerCount Count Pre-warmed workers available in pool
MaintenanceScanDuration Milliseconds Total time for full scan + fan-out
ClaimLatency Milliseconds Time from claim request to worker active

These metrics feed the orchestrator dashboard and drive alarm thresholds for operational anomalies.


Alarms

Alarm Metric Threshold Period Action
DLQ Messages ApproximateNumberOfMessagesVisible on DLQ > 0 1 min SNS → Email
Worker-Control Lambda Errors Errors for worker_control function > 5 in 10 min 5 min × 2 eval SNS → Email
Maintenance Lambda Errors Errors for maintenance function > 3 in 10 min 5 min × 2 eval SNS → Email
High Orphan Count OrphanMarksDetected + OrphanTasksDetected > 10 in 5 min 5 min SNS → Email
Scale-Down Tasks RunningTaskCount for worker service 0 for 30 min 30 min Auto-scaling (scale in)
Scale-Down CPU CPUReservation for worker cluster < 5% for 30 min 5 min × 6 eval Auto-scaling (scale in)
API High CPU CPUUtilization for API service > 80% for 5 min 5 min Auto-scaling (scale out)
API 5xx Rate HTTPCode_Target_5XX_Count on ALB > 10 in 5 min 5 min SNS → Email
RDS CPU CPUUtilization on RDS > 80% for 10 min 10 min SNS → Email

DLQ Alarm is Critical

A message in a dead-letter queue means a worker command or order task failed 3 consecutive times. This could mean: a Lambda function is crashing, Redis is unreachable, ECS is out of capacity, or a broker API is down. Every DLQ message requires investigation within minutes during market hours.

Alarm Priority Matrix

Priority Alarms Response Time
P0 — Immediate DLQ Messages, Lambda Errors, API 5xx < 5 minutes during market hours
P1 — Urgent High Orphan Count, RDS CPU < 15 minutes
P2 — Monitor Scale-Down, API High CPU Next business day

Dashboard

The CloudWatch dashboard orchestrator provides a single view of system health:

The CloudWatch dashboard orchestrator is organized into 4 rows:

Row Widgets Key Metrics
Lambda Health Invocations, Errors, Duration All 5 functions; p50/p95/p99 latency
Queue Health Queue Depth, Message Age, DLQ Count worker-control.fifo, order-tasks.fifo
ECS Health Task Count, CPU, Memory Running vs desired; API + Worker utilization
Maintenance Active Workers, Orphans, Launch Latency Pool claim vs RunTask timing

Dashboard Widgets

Widget Source Refresh Purpose
Lambda Invocations CloudWatch Logs 1 min Verify orchestrator is running. Zero invocations = EventBridge broken.
Lambda Errors CloudWatch Logs 1 min Any non-zero value requires investigation.
Lambda Duration CloudWatch Logs 1 min p95 > 30s on maintenance indicates scaling issues.
SQS Queue Depth SQS metrics 1 min Growing depth = Lambda can't keep up. Scale concurrency.
SQS Message Age SQS metrics 1 min Age > visibility timeout = messages being reprocessed.
DLQ Message Count SQS metrics 1 min Must always be 0. Non-zero = broken processing.
ECS Running Tasks ECS metrics 1 min Compare to desired. Divergence = capacity or scheduling issue.
Active Worker Count Custom metric 1 min Tracks concurrent users. Correlates with business metrics.
Orphan Count Custom metric 1 min Baseline should be < 2. Sustained > 5 indicates a systemic issue.
Worker Launch Latency Custom metric 1 min Pool path should be < 1s. RunTask path < 5s. Degradation = pool empty.

SNS Alerting

Alert Flow

flowchart TB
    Sources["ECS + Lambda + SQS + RDS + Custom Metrics"] --> CW["CloudWatch Metrics"]
    Logs["CloudWatch Logs"] --> MF["Metric Filters"] --> CW
    CW --> Alarms["Alarms"]
    Alarms -->|ALARM| SNS["SNS"] --> Email["Email"]

SNS Configuration

Parameter Value
Topic Name shioaji-alerts
Protocol Email
Subscribers Operations team distribution list
Delivery Retry 3 attempts with exponential backoff
Encryption SSE enabled (AWS-managed key)

Alert emails include:

  • Alarm name and description
  • Current metric value vs threshold
  • State change (OK → ALARM or ALARM → OK)
  • Timestamp (UTC)
  • Direct link to CloudWatch console

Audit Logging

Application-level audit logging captures every authenticated action for compliance, security monitoring, and incident investigation.

What's Captured

Field Source Example
user_id Session / JWT 42
action Application code webhook.order_executed, auth.login, account.credentials_updated
ip_address X-Forwarded-For header 203.0.113.42
user_agent User-Agent header Mozilla/5.0...
request_path Request object /api/v1/webhook/tradingview
request_method Request object POST
success Application logic true / false
error_message Exception handler null or "Invalid webhook token"
details Context-specific JSON {"order_id": "abc123", "symbol": "2330"}
created_at Server timestamp 2025-02-27T09:15:32.841Z

Storage

Audit logs are stored in the audit_logs PostgreSQL table:

CREATE TABLE audit_logs (
    id BIGSERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    action VARCHAR(100) NOT NULL,
    ip_address INET,
    user_agent TEXT,
    request_path VARCHAR(500),
    request_method VARCHAR(10),
    success BOOLEAN DEFAULT true,
    error_message TEXT,
    details JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_audit_user_action ON audit_logs (user_id, action, created_at);
CREATE INDEX idx_audit_created ON audit_logs (created_at);

Audit Use Cases

Use Case Query Pattern
Security investigation Filter by IP address + failed actions in a time window
Compliance reporting All actions by user in a date range
Attack detection Failed login attempts grouped by IP
Debugging All webhook executions for a specific user + order
Usage analytics Action counts by type over time

Retention Policy

Audit logs are retained for 90 days in the primary database. At scale (10K+ users), older logs are archived to S3 in Parquet format for long-term retention and cost-effective querying via Athena.


Monitoring Data Flow

flowchart LR
    subgraph sources["Sources"]
        App["API + Workers + Lambda"]
    end

    subgraph cloudwatch["CloudWatch"]
        Logs["Logs"]
        Metrics["Metrics"]
        Alarms["Alarms"]
    end

    subgraph alerting["Alerting"]
        SNS["SNS"] --> Email["Email"]
    end

    App --> Logs --> Metrics --> Alarms --> SNS
    App -->|"log_audit()"| DB["audit_logs table"]

Metric Sources: Container Insights (CPU, memory, network), Lambda metrics (invocations, errors, duration), SQS metrics (depth, age, DLQ), RDS metrics (CPU, connections), Valkey metrics (ECPU, storage), plus custom Shioaji/Maintenance namespace.

Observability Principle

If it can fail, it has a metric. If the metric can breach a threshold, it has an alarm. If the alarm fires, someone gets an email. No silent failures.