Achieving High Availability in AWS
Using Multiple Availability Zones
Security and availability go hand-in-hand. One of the cornerstones of AWS security is building architecture to achieve high availability. To do so, you must use multiple Availability Zones. By using multiple Available Zones, AWS explains, you can “architect applications that automatically fail-over between Availability Zones without interruption. These Availability Zones offer AWS customers an easier and more effective way to design and operate applications and databases, making them more highly available, fault tolerant, and scalable than traditional single data center infrastructures or multi-data center infrastructures.”
There are many methods for achieving high availability in AWS, but in this demo, AWS expert Randy Bartels will walk through two of the most direct ways:
- Using an ALB to determine which targets are healthy and which should be removed
- Using Route 53 DNS with a weighted routing policy using health checks
For a visual guide to achieving high availability in AWS, watch the full demo. To learn more about AWS’ fault-tolerant infrastructure, read more.
For our first environment, let’s take a quick look at the actual layout, here. Route 53 will be used to route our DNS domain name, our webserver, to a load balancer. That load balancer will only be listening on HTTPS. Then, it will route the traffic to either of our webservers, one located in us-east-1a and the other located in us-east-1b. Let’s take a look at the Amazon Web Services configuration for this.
We can see, here, that we have two EC2 instances: Webserver-1 and Webserver-2. One is located in 1a Availability Zone and the other is in 1b. If we take a look at our load balancer, we see that we have just the port 443 (HTTPS) listener and that is being targeted to our webservers-targetgroup. Taking a look at our target group, we see that we have both of our webservers listed in there. We’re routing the traffic from the load balancer to the webservers on TCP port 80. We can see that we have our “Health Checks,” as well, and that is just making a simple request for /index.html. If we take a look at our targets, they are both healthy. We should be able to demonstrate this by going to our webpage and refreshing until we see both Webserver-1 and Webserver-2 coming up. Our load balancing is working correctly there and both webservers are in the pool.
Now, let’s take one of our webservers down. So, here this is Webserver-1. Let’s actually just pull the status first, apache2 status. We see that it is, indeed, running. We knew that. Then, let’s issue a stop command and bring the webserver down. If we come back over here and we refresh enough times, quickly, there we go, we get a 502 Bad Gateway error. That’s because the health check has not yet caught up. If we take a look at our target group, it should take no more than 10 seconds for this to catch up. We refresh here, check our targets, and sure enough, Webserver-1 is now unhealthy. If we refresh here enough times, we’re only going to get Webserver-2. Webserver-1 is no longer in the pool. We’ve demonstrated the loss of one of data centers, maybe it was taken down for maintenance, or maybe Amazon had one of their problems that they have from time to time, certainly not very often. That demonstrates the load balancer method of high availability.
Let’s move on to the DNS method. Going back to our diagram, in this case, we’ll take the load balancer out of it and we’ll use a weighted routing policy with a health check through Route 53. We will use a different domain name, just to save time and demonstrate the two concepts side-by-side. Again, we still have the same two webservers serving up traffic, this time directly on HTTP based on this Route 53 routing policy. Let’s take a look at this configuration. Taking a look at Route 53, we have, here, two entries for our high availability domain name. We see that one routes to Webserver-1 and one routes to Webserver-2. They are each routed with a 50% weight. They have an equal chance of serving up. If we take a look at even just one of these at a time, we see that it is associated with a “Health Check” and then the same thing for the second one. What’s in this “Health Check?” We see that Webserver-2 reports as healthy and Webserver-1 as, actually, on the fritz right now. It has started reporting unhealthy. If we look at these health checks, they make the same request for /index.html and there is a 30-second request interval on the health check. You’ve already lost some of that granularity that we can see in the load balancer method as it will take up to 90 seconds in order for the Route 53 health check to catch up to the fact that a webserver is down. If we take a look at this prior to having brought that webserver down, we see that Webserver-1 came up as our request. Rather than try to troubleshoot all the DNS things, we know DNS troubleshooting can always be fun. In fact, it’s so fun that there’s even a haiku about it. “It’s not DNS. There’s no way it’s DNS. It was DNS.” We’ll not get caught up in that. Instead, we will bring up another TCP stack where we made that request. This was made while Webserver-1 was down. We can see, after doing a dig, that we only get Webserver-2 back in our response. That demonstrates the Route 53 DNS-based fail-over method and now we’ve demonstrated both of them.