Monitoring & Observability¶

Every component in the platform emits metrics, logs, and traces. The monitoring stack is built entirely on CloudWatch — no third-party agents, no additional infrastructure to manage. Alarms trigger SNS notifications for immediate attention, and a centralized dashboard provides a single-pane view of system health.

CloudWatch Metrics¶

Container Insights¶

ECS Container Insights is enabled on the cluster, providing per-task and per-service metrics without any application-level instrumentation:

Metric	Granularity	What It Shows
CPU Utilization	Per task, per service	Processing load. API tasks should stay < 70%. Worker tasks typically < 5%.
Memory Utilization	Per task, per service	Memory pressure. Workers at > 80% of soft limit (384 MB) indicate broker SDK memory growth.
Network RX/TX	Per task	Traffic volume. Sudden spikes may indicate webhook floods or broker API retries.
Task Count	Per service	Running, pending, desired count. Divergence between desired and running indicates capacity issues.
Storage Read/Write	Per task	Disk I/O. Should be near-zero for workers (all state in Redis/RDS).

Custom Metrics — Shioaji/Maintenance Namespace¶

The maintenance Lambda publishes custom metrics after every scan cycle:

Metric	Unit	Description
`ActiveWorkerCount`	Count	Total active workers across all users
`OrphanMarksDetected`	Count	Redis keys with no matching ECS task
`OrphanTasksDetected`	Count	ECS tasks with no matching Redis key
`StaleMarksDetected`	Count	Redis keys past TTL that weren't cleaned
`WorkerLaunchLatency`	Milliseconds	Time from start command to worker ready
`PoolWorkerCount`	Count	Pre-warmed workers available in pool
`MaintenanceScanDuration`	Milliseconds	Total time for full scan + fan-out
`ClaimLatency`	Milliseconds	Time from claim request to worker active

These metrics feed the orchestrator dashboard and drive alarm thresholds for operational anomalies.

Alarms¶

Alarm	Metric	Threshold	Period	Action
DLQ Messages	ApproximateNumberOfMessagesVisible on DLQ	> 0	1 min	SNS → Email
Worker-Control Lambda Errors	Errors for worker_control function	> 5 in 10 min	5 min × 2 eval	SNS → Email
Maintenance Lambda Errors	Errors for maintenance function	> 3 in 10 min	5 min × 2 eval	SNS → Email
High Orphan Count	OrphanMarksDetected + OrphanTasksDetected	> 10 in 5 min	5 min	SNS → Email
Scale-Down Tasks	RunningTaskCount for worker service	0 for 30 min	30 min	Auto-scaling (scale in)
Scale-Down CPU	CPUReservation for worker cluster	< 5% for 30 min	5 min × 6 eval	Auto-scaling (scale in)
API High CPU	CPUUtilization for API service	> 80% for 5 min	5 min	Auto-scaling (scale out)
API 5xx Rate	HTTPCode_Target_5XX_Count on ALB	> 10 in 5 min	5 min	SNS → Email
RDS CPU	CPUUtilization on RDS	> 80% for 10 min	10 min	SNS → Email

DLQ Alarm is Critical

A message in a dead-letter queue means a worker command or order task failed 3 consecutive times. This could mean: a Lambda function is crashing, Redis is unreachable, ECS is out of capacity, or a broker API is down. Every DLQ message requires investigation within minutes during market hours.

Alarm Priority Matrix¶

Priority	Alarms	Response Time
P0 — Immediate	DLQ Messages, Lambda Errors, API 5xx	< 5 minutes during market hours
P1 — Urgent	High Orphan Count, RDS CPU	< 15 minutes
P2 — Monitor	Scale-Down, API High CPU	Next business day

Dashboard¶

The CloudWatch dashboard orchestrator provides a single view of system health:

The CloudWatch dashboard orchestrator is organized into 4 rows:

Row	Widgets	Key Metrics
Lambda Health	Invocations, Errors, Duration	All 5 functions; p50/p95/p99 latency
Queue Health	Queue Depth, Message Age, DLQ Count	worker-control.fifo, order-tasks.fifo
ECS Health	Task Count, CPU, Memory	Running vs desired; API + Worker utilization
Maintenance	Active Workers, Orphans, Launch Latency	Pool claim vs RunTask timing

Dashboard Widgets¶

Widget	Source	Refresh	Purpose
Lambda Invocations	CloudWatch Logs	1 min	Verify orchestrator is running. Zero invocations = EventBridge broken.
Lambda Errors	CloudWatch Logs	1 min	Any non-zero value requires investigation.
Lambda Duration	CloudWatch Logs	1 min	p95 > 30s on maintenance indicates scaling issues.
SQS Queue Depth	SQS metrics	1 min	Growing depth = Lambda can't keep up. Scale concurrency.
SQS Message Age	SQS metrics	1 min	Age > visibility timeout = messages being reprocessed.
DLQ Message Count	SQS metrics	1 min	Must always be 0. Non-zero = broken processing.
ECS Running Tasks	ECS metrics	1 min	Compare to desired. Divergence = capacity or scheduling issue.
Active Worker Count	Custom metric	1 min	Tracks concurrent users. Correlates with business metrics.
Orphan Count	Custom metric	1 min	Baseline should be < 2. Sustained > 5 indicates a systemic issue.
Worker Launch Latency	Custom metric	1 min	Pool path should be < 1s. RunTask path < 5s. Degradation = pool empty.

Alert Flow¶

flowchart TB
    Sources["ECS + Lambda + SQS + RDS + Custom Metrics"] --> CW["CloudWatch Metrics"]
    Logs["CloudWatch Logs"] --> MF["Metric Filters"] --> CW
    CW --> Alarms["Alarms"]
    Alarms -->|ALARM| SNS["SNS"] --> Email["Email"]

Parameter	Value
Topic Name	`shioaji-alerts`
Protocol	Email
Subscribers	Operations team distribution list
Delivery Retry	3 attempts with exponential backoff
Encryption	SSE enabled (AWS-managed key)

Alert emails include:

Alarm name and description
Current metric value vs threshold
State change (OK → ALARM or ALARM → OK)
Timestamp (UTC)
Direct link to CloudWatch console

Audit Logging¶

Application-level audit logging captures every authenticated action for compliance, security monitoring, and incident investigation.

What's Captured¶

Field	Source	Example
`user_id`	Session / JWT	`42`
`action`	Application code	`webhook.order_executed`, `auth.login`, `account.credentials_updated`
`ip_address`	X-Forwarded-For header	`203.0.113.42`
`user_agent`	User-Agent header	`Mozilla/5.0...`
`request_path`	Request object	`/api/v1/webhook/tradingview`
`request_method`	Request object	`POST`
`success`	Application logic	`true` / `false`
`error_message`	Exception handler	`null` or `"Invalid webhook token"`
`details`	Context-specific JSON	`{"order_id": "abc123", "symbol": "2330"}`
`created_at`	Server timestamp	`2025-02-27T09:15:32.841Z`

Storage¶

Audit logs are stored in the audit_logs PostgreSQL table:

CREATE TABLE audit_logs (
    id BIGSERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id),
    action VARCHAR(100) NOT NULL,
    ip_address INET,
    user_agent TEXT,
    request_path VARCHAR(500),
    request_method VARCHAR(10),
    success BOOLEAN DEFAULT true,
    error_message TEXT,
    details JSONB,
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_audit_user_action ON audit_logs (user_id, action, created_at);
CREATE INDEX idx_audit_created ON audit_logs (created_at);

Audit Use Cases¶

Use Case	Query Pattern
Security investigation	Filter by IP address + failed actions in a time window
Compliance reporting	All actions by user in a date range
Attack detection	Failed login attempts grouped by IP
Debugging	All webhook executions for a specific user + order
Usage analytics	Action counts by type over time

Retention Policy

Audit logs are retained for 90 days in the primary database. At scale (10K+ users), older logs are archived to S3 in Parquet format for long-term retention and cost-effective querying via Athena.

Monitoring Data Flow¶

flowchart LR
    subgraph sources["Sources"]
        App["API + Workers + Lambda"]
    end

    subgraph cloudwatch["CloudWatch"]
        Logs["Logs"]
        Metrics["Metrics"]
        Alarms["Alarms"]
    end

    subgraph alerting["Alerting"]
        SNS["SNS"] --> Email["Email"]
    end

    App --> Logs --> Metrics --> Alarms --> SNS
    App -->|"log_audit()"| DB["audit_logs table"]

Metric Sources: Container Insights (CPU, memory, network), Lambda metrics (invocations, errors, duration), SQS metrics (depth, age, DLQ), RDS metrics (CPU, connections), Valkey metrics (ECPU, storage), plus custom Shioaji/Maintenance namespace.

Observability Principle

If it can fail, it has a metric. If the metric can breach a threshold, it has an alarm. If the alarm fires, someone gets an email. No silent failures.