Scaling to 100K Users¶
Every scaling tier is a terraform.tfvars change. Auto-scaling does the rest.
The platform was designed so that no component requires replacement to reach 100K users. Instance types get larger, replica counts increase, concurrency limits go up — but the architecture, the code, and the deployment pipeline remain the same. This page documents exactly what changes at each order of magnitude.
Scaling Math¶
The empirical model for peak concurrent workers:
| Variable | Value | Rationale |
|---|---|---|
| DAU Rate | 50% | Half of registered users are active on any given trading day |
| Peak Concurrency | 70% | Of daily actives, 70% are online simultaneously during market hours (09:00–10:30 peak) |
| Peak Workers | 0.35 × N | Each concurrent user requires one dedicated worker |
Workers per Instance¶
| Instance Type | vCPU | RAM | $/hr (SGP) | Workers @ 384 MB soft | Workers @ 64 CPU | Conservative | $/worker/hr |
|---|---|---|---|---|---|---|---|
| r6i.large | 2 | 16 GB | $0.152 | 42 | 32 | 30 | $0.00507 |
| r6i.xlarge | 4 | 32 GB | $0.304 | 85 | 64 | 60 | $0.00507 |
| r6i.2xlarge | 8 | 64 GB | $0.608 | 170 | 128 | 120 | $0.00507 |
| r6i.4xlarge | 16 | 128 GB | $1.216 | 340 | 256 | 240 | $0.00507 |
| r6i.8xlarge | 32 | 256 GB | $2.432 | 680 | 512 | 480 | $0.00507 |
Conservative Targets
The "conservative" column accounts for OS overhead (~512 MB), ECS agent memory, and headroom for memory spikes during broker API calls. Production targets are set 25–30% below the theoretical maximum.
Cost per worker is identical across all sizes
r6i pricing is perfectly linear — doubling the instance size doubles the price. The advantage of larger instances is operational: fewer instances to manage, fewer ECS agents, fewer maintenance Lambda iterations, and better headroom for memory spikes. The strategy is to scale up instance sizes as user count grows, keeping the fleet at 15–40 instances regardless of tier.
Tier Table¶
| Tier | Peak Workers | Worker EC2 | Worker Type | API EC2 | API Type | Lambda Concurrency | ElastiCache | RDS |
|---|---|---|---|---|---|---|---|---|
| 100 | 35 | 2 | r6i.large | 3 | m6i.large | 50 | Valkey Serverless 1 GB | db.t3.large |
| 500 | 175 | 6 | r6i.large | 5 | m6i.large | 50 | Valkey Serverless 1 GB | db.t3.large |
| 1K | 350 | 12 | r6i.large | 10 | m6i.large | 100 | Valkey 5 GB / 10K ECPU | db.r6g.large Multi-AZ |
| 5K | 1,750 | 15 | r6i.2xlarge | 20 | m6i.xlarge | 200 | Valkey 5 GB / 50K ECPU | db.r6g.large + read replica |
| 10K | 3,500 | 15 | r6i.4xlarge | 30 | m6i.xlarge | 500 | Valkey 10 GB / 100K ECPU | db.r6g.xlarge + read replica |
| 50K | 17,500 | 37 | r6i.8xlarge | 50 | m6i.2xlarge | 500 | Valkey cluster 3 shards | db.r6g.2xlarge + 2 replicas |
| 100K | 35,000 | 73 | r6i.8xlarge | 50 | m6i.2xlarge | 500 | Valkey cluster 6 shards | db.r6g.4xlarge + 2 replicas |
Reading the Table
At the 5K tier: 5,000 users × 0.35 = 1,750 peak workers. Rather than 59 × r6i.large, we scale up to r6i.2xlarge (120 workers each) → ceil(1750/120) = 15 instances. This keeps the fleet small and manageable. Cost per worker is identical ($0.00507/hr) — the savings come from fewer ECS agents and simpler operations. API scales to 20 m6i.xlarge. Lambda concurrency increases to 200. Valkey ECPU lifts to 50K. RDS adds a read replica.
What's Done vs What's Next¶
Deployed Today¶
Everything needed for the first 500 users is production-ready:
| Component | Status | Details |
|---|---|---|
| ECS EC2 Capacity Providers | :white_check_mark: Deployed | Dual ASG (API + Worker), auto-scaling policies |
| Lambda Orchestrator | :white_check_mark: Deployed | 5 functions, SQS + EventBridge triggers |
| SQS FIFO Queues | :white_check_mark: Deployed | worker-control, order-tasks, pool-claim + DLQs |
| Pool Pre-Warming | :white_check_mark: Deployed | Pool manager Lambda, claim flow, ~897ms readiness |
| ElastiCache Valkey Serverless | :white_check_mark: Deployed | Auto-scaling storage and ECPU |
| RDS Aurora + RDS Proxy | :white_check_mark: Deployed | Connection pooling, single instance |
| ALB + WAF | :white_check_mark: Deployed | TLS, rate limiting, managed rule groups |
| Terraform IaC | :white_check_mark: Deployed | ~80 resources, reproducible environments |
| CI/CD Pipelines | :white_check_mark: Deployed | GitHub Actions → ECR → ECS rolling deploy |
| CloudWatch Monitoring | :white_check_mark: Deployed | Container Insights, custom metrics, alarms, dashboard |
Future Scaling Work¶
| Change | Trigger | Effort | Impact |
|---|---|---|---|
| Multi-AZ RDS | 1K users | terraform.tfvars change |
HA for database layer |
| Read replicas | 5K users | Add replica config to Terraform | Offload read queries |
| Valkey ECPU scaling | 10K users | terraform.tfvars change |
Handle heartbeat volume |
| Graviton instances (r7g, m7g) | 50K users | AMI + instance type change | 20% cost reduction |
| Provisioned Valkey cluster | 50K users | Migrate from serverless to provisioned | Predictable pricing at high volume |
| Reserved Instances | 5K users | AWS console / Terraform | 30–40% savings on steady-state compute |
| SQS for all background tasks | 10K users | Application code + Terraform | Remove background tasks from API hot path |
| Redis sorted set for worker tracking | 10K users | Application code change | O(log N) worker lookup vs O(N) key scan |
Architectural Changes by Tier¶
| Change | 100 | 500 | 1K | 5K | 10K | 50K | 100K |
|---|---|---|---|---|---|---|---|
| Single-AZ RDS | :white_check_mark: | :white_check_mark: | — | — | — | — | — |
| Multi-AZ RDS | — | — | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| RDS Read Replicas | — | — | — | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| Rate Limiting Middleware | — | — | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| SQS Background Tasks | — | — | — | Partial | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| Reserved Instances | — | — | — | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: |
| Graviton Instances | — | — | — | — | — | :white_check_mark: | :white_check_mark: |
| Shield Advanced | — | — | — | — | — | :white_check_mark: | :white_check_mark: |
| Cross-Region Backup | — | — | — | — | — | :white_check_mark: | :white_check_mark: |
| Provisioned Valkey Cluster | — | — | — | — | — | :white_check_mark: | :white_check_mark: |
| Redis Sorted Set Tracking | — | — | — | — | :white_check_mark: | :white_check_mark: | :white_check_mark: |
Scaling Bottlenecks and Mitigations¶
Known Bottlenecks
These are the components that will feel pressure first as user count grows.
| Bottleneck | Threshold | Symptom | Mitigation |
|---|---|---|---|
| Redis key scan | 5K workers | Maintenance Lambda timeout increases | Switch to Redis sorted set for O(log N) lookups |
| ECS API rate limits | 10K tasks | RunTask throttling (10 TPS default) | Request limit increase from AWS, batch operations |
| RDS connections | 500 tasks | Connection pool exhaustion | RDS Proxy (already deployed), increase proxy pool |
| Lambda concurrency | 10K users | SQS queue depth increases | Request concurrency limit increase from AWS |
| ALB connection count | 50K concurrent | 5xx errors from ALB | Add ALB instances (automatic), increase idle timeout |
Each bottleneck has a known mitigation that requires no architectural change — only configuration or AWS support ticket.
The Scaling Philosophy
Optimize for the current tier, plan for the next tier, and ensure nothing blocks the tier after that. Over-engineering for 100K when you have 100 users wastes money and adds complexity. Under-engineering means a rewrite when growth comes.