Monitoring & Observability¶
Every component in the platform emits metrics, logs, and traces. The monitoring stack is built entirely on CloudWatch — no third-party agents, no additional infrastructure to manage. Alarms trigger SNS notifications for immediate attention, and a centralized dashboard provides a single-pane view of system health.
CloudWatch Metrics¶
Container Insights¶
ECS Container Insights is enabled on the cluster, providing per-task and per-service metrics without any application-level instrumentation:
| Metric | Granularity | What It Shows |
|---|---|---|
| CPU Utilization | Per task, per service | Processing load. API tasks should stay < 70%. Worker tasks typically < 5%. |
| Memory Utilization | Per task, per service | Memory pressure. Workers at > 80% of soft limit (384 MB) indicate broker SDK memory growth. |
| Network RX/TX | Per task | Traffic volume. Sudden spikes may indicate webhook floods or broker API retries. |
| Task Count | Per service | Running, pending, desired count. Divergence between desired and running indicates capacity issues. |
| Storage Read/Write | Per task | Disk I/O. Should be near-zero for workers (all state in Redis/RDS). |
Custom Metrics — Shioaji/Maintenance Namespace¶
The maintenance Lambda publishes custom metrics after every scan cycle:
| Metric | Unit | Description |
|---|---|---|
ActiveWorkerCount |
Count | Total active workers across all users |
OrphanMarksDetected |
Count | Redis keys with no matching ECS task |
OrphanTasksDetected |
Count | ECS tasks with no matching Redis key |
StaleMarksDetected |
Count | Redis keys past TTL that weren't cleaned |
WorkerLaunchLatency |
Milliseconds | Time from start command to worker ready |
PoolWorkerCount |
Count | Pre-warmed workers available in pool |
MaintenanceScanDuration |
Milliseconds | Total time for full scan + fan-out |
ClaimLatency |
Milliseconds | Time from claim request to worker active |
These metrics feed the orchestrator dashboard and drive alarm thresholds for operational anomalies.
Alarms¶
| Alarm | Metric | Threshold | Period | Action |
|---|---|---|---|---|
| DLQ Messages | ApproximateNumberOfMessagesVisible on DLQ | > 0 | 1 min | SNS → Email |
| Worker-Control Lambda Errors | Errors for worker_control function | > 5 in 10 min | 5 min × 2 eval | SNS → Email |
| Maintenance Lambda Errors | Errors for maintenance function | > 3 in 10 min | 5 min × 2 eval | SNS → Email |
| High Orphan Count | OrphanMarksDetected + OrphanTasksDetected | > 10 in 5 min | 5 min | SNS → Email |
| Scale-Down Tasks | RunningTaskCount for worker service | 0 for 30 min | 30 min | Auto-scaling (scale in) |
| Scale-Down CPU | CPUReservation for worker cluster | < 5% for 30 min | 5 min × 6 eval | Auto-scaling (scale in) |
| API High CPU | CPUUtilization for API service | > 80% for 5 min | 5 min | Auto-scaling (scale out) |
| API 5xx Rate | HTTPCode_Target_5XX_Count on ALB | > 10 in 5 min | 5 min | SNS → Email |
| RDS CPU | CPUUtilization on RDS | > 80% for 10 min | 10 min | SNS → Email |
DLQ Alarm is Critical
A message in a dead-letter queue means a worker command or order task failed 3 consecutive times. This could mean: a Lambda function is crashing, Redis is unreachable, ECS is out of capacity, or a broker API is down. Every DLQ message requires investigation within minutes during market hours.
Alarm Priority Matrix¶
| Priority | Alarms | Response Time |
|---|---|---|
| P0 — Immediate | DLQ Messages, Lambda Errors, API 5xx | < 5 minutes during market hours |
| P1 — Urgent | High Orphan Count, RDS CPU | < 15 minutes |
| P2 — Monitor | Scale-Down, API High CPU | Next business day |
Dashboard¶
The CloudWatch dashboard orchestrator provides a single view of system health:
The CloudWatch dashboard orchestrator is organized into 4 rows:
| Row | Widgets | Key Metrics |
|---|---|---|
| Lambda Health | Invocations, Errors, Duration | All 5 functions; p50/p95/p99 latency |
| Queue Health | Queue Depth, Message Age, DLQ Count | worker-control.fifo, order-tasks.fifo |
| ECS Health | Task Count, CPU, Memory | Running vs desired; API + Worker utilization |
| Maintenance | Active Workers, Orphans, Launch Latency | Pool claim vs RunTask timing |
Dashboard Widgets¶
| Widget | Source | Refresh | Purpose |
|---|---|---|---|
| Lambda Invocations | CloudWatch Logs | 1 min | Verify orchestrator is running. Zero invocations = EventBridge broken. |
| Lambda Errors | CloudWatch Logs | 1 min | Any non-zero value requires investigation. |
| Lambda Duration | CloudWatch Logs | 1 min | p95 > 30s on maintenance indicates scaling issues. |
| SQS Queue Depth | SQS metrics | 1 min | Growing depth = Lambda can't keep up. Scale concurrency. |
| SQS Message Age | SQS metrics | 1 min | Age > visibility timeout = messages being reprocessed. |
| DLQ Message Count | SQS metrics | 1 min | Must always be 0. Non-zero = broken processing. |
| ECS Running Tasks | ECS metrics | 1 min | Compare to desired. Divergence = capacity or scheduling issue. |
| Active Worker Count | Custom metric | 1 min | Tracks concurrent users. Correlates with business metrics. |
| Orphan Count | Custom metric | 1 min | Baseline should be < 2. Sustained > 5 indicates a systemic issue. |
| Worker Launch Latency | Custom metric | 1 min | Pool path should be < 1s. RunTask path < 5s. Degradation = pool empty. |
SNS Alerting¶
Alert Flow¶
flowchart TB
Sources["ECS + Lambda + SQS + RDS + Custom Metrics"] --> CW["CloudWatch Metrics"]
Logs["CloudWatch Logs"] --> MF["Metric Filters"] --> CW
CW --> Alarms["Alarms"]
Alarms -->|ALARM| SNS["SNS"] --> Email["Email"]
SNS Configuration¶
| Parameter | Value |
|---|---|
| Topic Name | shioaji-alerts |
| Protocol | |
| Subscribers | Operations team distribution list |
| Delivery Retry | 3 attempts with exponential backoff |
| Encryption | SSE enabled (AWS-managed key) |
Alert emails include:
- Alarm name and description
- Current metric value vs threshold
- State change (OK → ALARM or ALARM → OK)
- Timestamp (UTC)
- Direct link to CloudWatch console
Audit Logging¶
Application-level audit logging captures every authenticated action for compliance, security monitoring, and incident investigation.
What's Captured¶
| Field | Source | Example |
|---|---|---|
user_id |
Session / JWT | 42 |
action |
Application code | webhook.order_executed, auth.login, account.credentials_updated |
ip_address |
X-Forwarded-For header | 203.0.113.42 |
user_agent |
User-Agent header | Mozilla/5.0... |
request_path |
Request object | /api/v1/webhook/tradingview |
request_method |
Request object | POST |
success |
Application logic | true / false |
error_message |
Exception handler | null or "Invalid webhook token" |
details |
Context-specific JSON | {"order_id": "abc123", "symbol": "2330"} |
created_at |
Server timestamp | 2025-02-27T09:15:32.841Z |
Storage¶
Audit logs are stored in the audit_logs PostgreSQL table:
CREATE TABLE audit_logs (
id BIGSERIAL PRIMARY KEY,
user_id INTEGER REFERENCES users(id),
action VARCHAR(100) NOT NULL,
ip_address INET,
user_agent TEXT,
request_path VARCHAR(500),
request_method VARCHAR(10),
success BOOLEAN DEFAULT true,
error_message TEXT,
details JSONB,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX idx_audit_user_action ON audit_logs (user_id, action, created_at);
CREATE INDEX idx_audit_created ON audit_logs (created_at);
Audit Use Cases¶
| Use Case | Query Pattern |
|---|---|
| Security investigation | Filter by IP address + failed actions in a time window |
| Compliance reporting | All actions by user in a date range |
| Attack detection | Failed login attempts grouped by IP |
| Debugging | All webhook executions for a specific user + order |
| Usage analytics | Action counts by type over time |
Retention Policy
Audit logs are retained for 90 days in the primary database. At scale (10K+ users), older logs are archived to S3 in Parquet format for long-term retention and cost-effective querying via Athena.
Monitoring Data Flow¶
flowchart LR
subgraph sources["Sources"]
App["API + Workers + Lambda"]
end
subgraph cloudwatch["CloudWatch"]
Logs["Logs"]
Metrics["Metrics"]
Alarms["Alarms"]
end
subgraph alerting["Alerting"]
SNS["SNS"] --> Email["Email"]
end
App --> Logs --> Metrics --> Alarms --> SNS
App -->|"log_audit()"| DB["audit_logs table"]
Metric Sources: Container Insights (CPU, memory, network), Lambda metrics (invocations, errors, duration), SQS metrics (depth, age, DLQ), RDS metrics (CPU, connections), Valkey metrics (ECPU, storage), plus custom Shioaji/Maintenance namespace.
Observability Principle
If it can fail, it has a metric. If the metric can breach a threshold, it has an alarm. If the alarm fires, someone gets an email. No silent failures.