Solving the stateful scaling crisis: High availability for session-based apps
I recently worked on a project where a large platform faced a familiar migration headache. They were moving a legacy monolithic application to AWS. The app relied heavily on server-side sessions stored in local memory to track login states and multi-step transaction wizards. This worked fine on a single server. But, on Monday mornings, the traffic spikes would push the CPU to its limit.
The obvious fix was to add an Auto Scaling Group. However, as soon as we did, users started reporting random logouts. Their session lived on Instance A. Their next request hit Instance B. Instance B had no idea who they were. This is the classic "Stateful Scaling" crisis.
Why sticky sessions are a trap
The first temptation is usually to enable Application Load Balancer (ALB) stickiness (also known as session affinity)[https://docs.aws.amazon.com/prescriptive-guidance/latest/load-balancer-stickiness/welcome.html]. It seems like an easy win. You "pin" the user to a specific instance for the duration of their session. This solves the immediate logout problem, but it creates three significant issues.
- First, it breaks true elasticity. If one instance becomes overloaded, the load balancer continues to send "stuck" users to it. It cannot shift that load to a fresh instance.
- Second, it ruins fault tolerance. If Instance A fails, every user pinned to it loses their progress. In a multi-step financial wizard, that is a terrible user experience.
- Finally, it makes rolling deployments stressful. You have to wait for sessions to drain before you can safely shut down an old node.
The solution: Externalizing state with Redis
To scale properly, we have to decouple state from compute. I moved the session storage from local RAM to a centralized Amazon ElastiCache for Redis cluster. This effectively makes the application nodes stateless.
With this architecture, the ALB can distribute traffic using a "least requests" algorithm. Any instance in the fleet can pick up any request because they all fetch session data from the same global cache.
Architectural best practices
I recommend placing the ElastiCache cluster in a private subnet. It should only be accessible by the application security group via port 6379. For a production environment, you must enable Multi-AZ with Automatic Failover. This ensures that even if a primary cache node or an entire Availability Zone goes down, the standby replica takes over in seconds.
I also suggest using IAM Authentication for Redis. It is a more secure alternative to static passwords. It integrates with your existing AWS permission sets and simplifies secret management.
Sizing and cost considerations
When calculating costs, don't just look at the instance price. For session storage, latency is everything. Redis provides the sub-millisecond response times needed to keep the UI snappy.
Using a cache.t4g.medium instance in us-east-1 is usually a great starting point for mid-sized applications. It offers 2 vCPUs and about 3 GiB of RAM. At current rates, this costs roughly $0.052 per hour. For a high-availability setup with two nodes (Primary and Replica), you are looking at approximately $75 per month.
I find this cost is easily justified by the reduction in support tickets from frustrated users. It also allows the team to sleep better knowing that a single server failure won't result in a localized outage.
A simple heuristic for state
If your application cannot survive the sudden termination of a backend node without affecting user sessions, it is not cloud-ready. Externalizing state is the single most important step in moving from a fragile "pet" server to a resilient "cattle" architecture.
References and resources
Amazon ElastiCache for Redis Documentation
Best Practices for Valkey and Redis OSS Clients and Amazon ElastiCache
