Skip to content

Scaling to 100K Users

Every scaling tier is a terraform.tfvars change. Auto-scaling does the rest.

The platform was designed so that no component requires replacement to reach 100K users. Instance types get larger, replica counts increase, concurrency limits go up — but the architecture, the code, and the deployment pipeline remain the same. This page documents exactly what changes at each order of magnitude.


Scaling Math

The empirical model for peak concurrent workers:

Peak Workers = Total Users × 0.50 (DAU rate) × 0.70 (peak concurrency) = 0.35 × N
Variable Value Rationale
DAU Rate 50% Half of registered users are active on any given trading day
Peak Concurrency 70% Of daily actives, 70% are online simultaneously during market hours (09:00–10:30 peak)
Peak Workers 0.35 × N Each concurrent user requires one dedicated worker

Workers per Instance

Instance Type vCPU RAM $/hr (SGP) Workers @ 384 MB soft Workers @ 64 CPU Conservative $/worker/hr
r6i.large 2 16 GB $0.152 42 32 30 $0.00507
r6i.xlarge 4 32 GB $0.304 85 64 60 $0.00507
r6i.2xlarge 8 64 GB $0.608 170 128 120 $0.00507
r6i.4xlarge 16 128 GB $1.216 340 256 240 $0.00507
r6i.8xlarge 32 256 GB $2.432 680 512 480 $0.00507

Conservative Targets

The "conservative" column accounts for OS overhead (~512 MB), ECS agent memory, and headroom for memory spikes during broker API calls. Production targets are set 25–30% below the theoretical maximum.

Cost per worker is identical across all sizes

r6i pricing is perfectly linear — doubling the instance size doubles the price. The advantage of larger instances is operational: fewer instances to manage, fewer ECS agents, fewer maintenance Lambda iterations, and better headroom for memory spikes. The strategy is to scale up instance sizes as user count grows, keeping the fleet at 15–40 instances regardless of tier.


Tier Table

Tier Peak Workers Worker EC2 Worker Type API EC2 API Type Lambda Concurrency ElastiCache RDS
100 35 2 r6i.large 3 m6i.large 50 Valkey Serverless 1 GB db.t3.large
500 175 6 r6i.large 5 m6i.large 50 Valkey Serverless 1 GB db.t3.large
1K 350 12 r6i.large 10 m6i.large 100 Valkey 5 GB / 10K ECPU db.r6g.large Multi-AZ
5K 1,750 15 r6i.2xlarge 20 m6i.xlarge 200 Valkey 5 GB / 50K ECPU db.r6g.large + read replica
10K 3,500 15 r6i.4xlarge 30 m6i.xlarge 500 Valkey 10 GB / 100K ECPU db.r6g.xlarge + read replica
50K 17,500 37 r6i.8xlarge 50 m6i.2xlarge 500 Valkey cluster 3 shards db.r6g.2xlarge + 2 replicas
100K 35,000 73 r6i.8xlarge 50 m6i.2xlarge 500 Valkey cluster 6 shards db.r6g.4xlarge + 2 replicas
Reading the Table

At the 5K tier: 5,000 users × 0.35 = 1,750 peak workers. Rather than 59 × r6i.large, we scale up to r6i.2xlarge (120 workers each) → ceil(1750/120) = 15 instances. This keeps the fleet small and manageable. Cost per worker is identical ($0.00507/hr) — the savings come from fewer ECS agents and simpler operations. API scales to 20 m6i.xlarge. Lambda concurrency increases to 200. Valkey ECPU lifts to 50K. RDS adds a read replica.


What's Done vs What's Next

Deployed Today

Everything needed for the first 500 users is production-ready:

Component Status Details
ECS EC2 Capacity Providers :white_check_mark: Deployed Dual ASG (API + Worker), auto-scaling policies
Lambda Orchestrator :white_check_mark: Deployed 5 functions, SQS + EventBridge triggers
SQS FIFO Queues :white_check_mark: Deployed worker-control, order-tasks, pool-claim + DLQs
Pool Pre-Warming :white_check_mark: Deployed Pool manager Lambda, claim flow, ~897ms readiness
ElastiCache Valkey Serverless :white_check_mark: Deployed Auto-scaling storage and ECPU
RDS Aurora + RDS Proxy :white_check_mark: Deployed Connection pooling, single instance
ALB + WAF :white_check_mark: Deployed TLS, rate limiting, managed rule groups
Terraform IaC :white_check_mark: Deployed ~80 resources, reproducible environments
CI/CD Pipelines :white_check_mark: Deployed GitHub Actions → ECR → ECS rolling deploy
CloudWatch Monitoring :white_check_mark: Deployed Container Insights, custom metrics, alarms, dashboard

Future Scaling Work

Change Trigger Effort Impact
Multi-AZ RDS 1K users terraform.tfvars change HA for database layer
Read replicas 5K users Add replica config to Terraform Offload read queries
Valkey ECPU scaling 10K users terraform.tfvars change Handle heartbeat volume
Graviton instances (r7g, m7g) 50K users AMI + instance type change 20% cost reduction
Provisioned Valkey cluster 50K users Migrate from serverless to provisioned Predictable pricing at high volume
Reserved Instances 5K users AWS console / Terraform 30–40% savings on steady-state compute
SQS for all background tasks 10K users Application code + Terraform Remove background tasks from API hot path
Redis sorted set for worker tracking 10K users Application code change O(log N) worker lookup vs O(N) key scan

Architectural Changes by Tier

Change 100 500 1K 5K 10K 50K 100K
Single-AZ RDS :white_check_mark: :white_check_mark:
Multi-AZ RDS :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
RDS Read Replicas :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Rate Limiting Middleware :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
SQS Background Tasks Partial :white_check_mark: :white_check_mark: :white_check_mark:
Reserved Instances :white_check_mark: :white_check_mark: :white_check_mark: :white_check_mark:
Graviton Instances :white_check_mark: :white_check_mark:
Shield Advanced :white_check_mark: :white_check_mark:
Cross-Region Backup :white_check_mark: :white_check_mark:
Provisioned Valkey Cluster :white_check_mark: :white_check_mark:
Redis Sorted Set Tracking :white_check_mark: :white_check_mark: :white_check_mark:

Scaling Bottlenecks and Mitigations

Known Bottlenecks

These are the components that will feel pressure first as user count grows.

Bottleneck Threshold Symptom Mitigation
Redis key scan 5K workers Maintenance Lambda timeout increases Switch to Redis sorted set for O(log N) lookups
ECS API rate limits 10K tasks RunTask throttling (10 TPS default) Request limit increase from AWS, batch operations
RDS connections 500 tasks Connection pool exhaustion RDS Proxy (already deployed), increase proxy pool
Lambda concurrency 10K users SQS queue depth increases Request concurrency limit increase from AWS
ALB connection count 50K concurrent 5xx errors from ALB Add ALB instances (automatic), increase idle timeout

Each bottleneck has a known mitigation that requires no architectural change — only configuration or AWS support ticket.

The Scaling Philosophy

Optimize for the current tier, plan for the next tier, and ensure nothing blocks the tier after that. Over-engineering for 100K when you have 100 users wastes money and adds complexity. Under-engineering means a rewrite when growth comes.