Scaling to 100K Users¶

Every scaling tier is a terraform.tfvars change. Auto-scaling does the rest.

The platform was designed so that no component requires replacement to reach 100K users. Instance types get larger, replica counts increase, concurrency limits go up — but the architecture, the code, and the deployment pipeline remain the same. This page documents exactly what changes at each order of magnitude.

Scaling Math¶

The empirical model for peak concurrent workers:

Peak Workers = Total Users × 0.50 (DAU rate) × 0.70 (peak concurrency) = 0.35 × N

Variable	Value	Rationale
DAU Rate	50%	Half of registered users are active on any given trading day
Peak Concurrency	70%	Of daily actives, 70% are online simultaneously during market hours (09:00–10:30 peak)
Peak Workers	0.35 × N	Each concurrent user requires one dedicated worker

Workers per Instance¶

Instance Type	vCPU	RAM	$/hr (SGP)	Workers @ 384 MB soft	Workers @ 64 CPU	Conservative	$/worker/hr
r6i.large	2	16 GB	$0.152	42	32	30	$0.00507
r6i.xlarge	4	32 GB	$0.304	85	64	60	$0.00507
r6i.2xlarge	8	64 GB	$0.608	170	128	120	$0.00507
r6i.4xlarge	16	128 GB	$1.216	340	256	240	$0.00507
r6i.8xlarge	32	256 GB	$2.432	680	512	480	$0.00507

Conservative Targets

The "conservative" column accounts for OS overhead (~512 MB), ECS agent memory, and headroom for memory spikes during broker API calls. Production targets are set 25–30% below the theoretical maximum.

Cost per worker is identical across all sizes

r6i pricing is perfectly linear — doubling the instance size doubles the price. The advantage of larger instances is operational: fewer instances to manage, fewer ECS agents, fewer maintenance Lambda iterations, and better headroom for memory spikes. The strategy is to scale up instance sizes as user count grows, keeping the fleet at 15–40 instances regardless of tier.

Tier Table¶

Tier	Peak Workers	Worker EC2	Worker Type	API EC2	API Type	Lambda Concurrency	ElastiCache	RDS
100	35	2	r6i.large	3	m6i.large	50	Valkey Serverless 1 GB	db.t3.large
500	175	6	r6i.large	5	m6i.large	50	Valkey Serverless 1 GB	db.t3.large
1K	350	12	r6i.large	10	m6i.large	100	Valkey 5 GB / 10K ECPU	db.r6g.large Multi-AZ
5K	1,750	15	r6i.2xlarge	20	m6i.xlarge	200	Valkey 5 GB / 50K ECPU	db.r6g.large + read replica
10K	3,500	15	r6i.4xlarge	30	m6i.xlarge	500	Valkey 10 GB / 100K ECPU	db.r6g.xlarge + read replica
50K	17,500	37	r6i.8xlarge	50	m6i.2xlarge	500	Valkey cluster 3 shards	db.r6g.2xlarge + 2 replicas
100K	35,000	73	r6i.8xlarge	50	m6i.2xlarge	500	Valkey cluster 6 shards	db.r6g.4xlarge + 2 replicas

Reading the Table

At the 5K tier: 5,000 users × 0.35 = 1,750 peak workers. Rather than 59 × r6i.large, we scale up to r6i.2xlarge (120 workers each) → ceil(1750/120) = 15 instances. This keeps the fleet small and manageable. Cost per worker is identical ($0.00507/hr) — the savings come from fewer ECS agents and simpler operations. API scales to 20 m6i.xlarge. Lambda concurrency increases to 200. Valkey ECPU lifts to 50K. RDS adds a read replica.

What's Done vs What's Next¶

Deployed Today¶

Everything needed for the first 500 users is production-ready:

Component	Status	Details
ECS EC2 Capacity Providers	:white_check_mark: Deployed	Dual ASG (API + Worker), auto-scaling policies
Lambda Orchestrator	:white_check_mark: Deployed	5 functions, SQS + EventBridge triggers
SQS FIFO Queues	:white_check_mark: Deployed	worker-control, order-tasks, pool-claim + DLQs
Pool Pre-Warming	:white_check_mark: Deployed	Pool manager Lambda, claim flow, ~897ms readiness
ElastiCache Valkey Serverless	:white_check_mark: Deployed	Auto-scaling storage and ECPU
RDS Aurora + RDS Proxy	:white_check_mark: Deployed	Connection pooling, single instance
ALB + WAF	:white_check_mark: Deployed	TLS, rate limiting, managed rule groups
Terraform IaC	:white_check_mark: Deployed	~80 resources, reproducible environments
CI/CD Pipelines	:white_check_mark: Deployed	GitHub Actions → ECR → ECS rolling deploy
CloudWatch Monitoring	:white_check_mark: Deployed	Container Insights, custom metrics, alarms, dashboard

Future Scaling Work¶

Change	Trigger	Effort	Impact
Multi-AZ RDS	1K users	`terraform.tfvars` change	HA for database layer
Read replicas	5K users	Add replica config to Terraform	Offload read queries
Valkey ECPU scaling	10K users	`terraform.tfvars` change	Handle heartbeat volume
Graviton instances (r7g, m7g)	50K users	AMI + instance type change	20% cost reduction
Provisioned Valkey cluster	50K users	Migrate from serverless to provisioned	Predictable pricing at high volume
Reserved Instances	5K users	AWS console / Terraform	30–40% savings on steady-state compute
SQS for all background tasks	10K users	Application code + Terraform	Remove background tasks from API hot path
Redis sorted set for worker tracking	10K users	Application code change	O(log N) worker lookup vs O(N) key scan

Architectural Changes by Tier¶

Change	100	500	1K	5K	10K	50K	100K
Single-AZ RDS	:white_check_mark:	:white_check_mark:	—	—	—	—	—
Multi-AZ RDS	—	—	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
RDS Read Replicas	—	—	—	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
Rate Limiting Middleware	—	—	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
SQS Background Tasks	—	—	—	Partial	:white_check_mark:	:white_check_mark:	:white_check_mark:
Reserved Instances	—	—	—	:white_check_mark:	:white_check_mark:	:white_check_mark:	:white_check_mark:
Graviton Instances	—	—	—	—	—	:white_check_mark:	:white_check_mark:
Shield Advanced	—	—	—	—	—	:white_check_mark:	:white_check_mark:
Cross-Region Backup	—	—	—	—	—	:white_check_mark:	:white_check_mark:
Provisioned Valkey Cluster	—	—	—	—	—	:white_check_mark:	:white_check_mark:
Redis Sorted Set Tracking	—	—	—	—	:white_check_mark:	:white_check_mark:	:white_check_mark:

Scaling Bottlenecks and Mitigations¶

Known Bottlenecks

These are the components that will feel pressure first as user count grows.

Bottleneck	Threshold	Symptom	Mitigation
Redis key scan	5K workers	Maintenance Lambda timeout increases	Switch to Redis sorted set for O(log N) lookups
ECS API rate limits	10K tasks	RunTask throttling (10 TPS default)	Request limit increase from AWS, batch operations
RDS connections	500 tasks	Connection pool exhaustion	RDS Proxy (already deployed), increase proxy pool
Lambda concurrency	10K users	SQS queue depth increases	Request concurrency limit increase from AWS
ALB connection count	50K concurrent	5xx errors from ALB	Add ALB instances (automatic), increase idle timeout

Each bottleneck has a known mitigation that requires no architectural change — only configuration or AWS support ticket.

The Scaling Philosophy

Optimize for the current tier, plan for the next tier, and ensure nothing blocks the tier after that. Over-engineering for 100K when you have 100 users wastes money and adds complexity. Under-engineering means a rewrite when growth comes.