Increasing the stability of your enterprise systems running in the AWS cloud

AWS2

We at M-Way Solutions are huge fans of Amazon’s EC2 (Elastic Compute Cloud) web services. They provide a very fast solution for creating and running servers, may it be for development or even production systems. If you need a new server, just start it with few clicks. As a developer, not having to wait for your admin to create a new server VM will make you very happy. And when you’re done, just terminate it and no more costs will be generated. Now that will make your CEO happy!

But the real power the Amazon EC2 infrastructure shows when it comes to clustering. In EC2 terms, the feature is called auto scaling groups. An auto scaling group defines the minimum and maximum number of running instances of your cluster. You can increase and decrease the current number of instances within this range either manually, schedule-based or dynamically (by demand). While “manually” and “schedule-based” are quite self-explanatory and straightforward, the dynamic scaling by demand is quite interesting in enterprise scenarios where it’s hard to forecast or you simply don’t know exactly when a high server usage will occur.

For this case, EC2 offers some predefined metrics, e.g. for CPU utilization. There you can configure for how long and above which percentage the CPU utilization is allowed to raise before an alarm will be triggered. This alarm can then be used inside the auto scaling group for configuring a scaling policy. To make it a bit more descriptive: We could create an alarm for a cpu utilization of 80% over 5 minutes. This alarm will be used in a scaling policy, to trigger adding another instance.

Of course you should also have an alarm and a scaling policy for the opposite case: an alarm that fires when CPU utilization falls below a certain value for a specific time, and a scaling policy that terminates one instance once this alarm is triggered. Together with the minimum and maximum number of instances (and of course a load balancer in front of your cluster nodes) you can very easily create a highly scalable enterprise system.

So far so good. But what if you want your cluster to be used not only for scalability but also for increasing stability? EC2 has something for that, too. You can create an alarm if the health check for an instance fails. These health checks are automatically performed by Amazon and include for example loss of network connectivity or system power.

But in most cases this is not enough. Often you may not only want a check that your server runs fine from an operating system point of view, but also that your server’s main application is working correctly. In our example, we have a server providing some proxy functionality. The core responsibility here is of course to proxy a request to another server.

So what do we have to do to detect an error and start a new instance? Amazon AWS provides a powerful API to get lots of information about our system and to interact with it. The first use of the API is to access our instances. Because we want our cluster dynamically up- and downscaled depending on the system load, we don’t have any fixed IP-addresses that we can loop.

The load balancer can tell us this: the API describe-load-balancers gets a list of all of our nodes. We can get specific information for a single node (or instance as it’s called in AWS terms), including its (private) IP address, with describe-instances. We now have one specific node and it’s internal IP address. Next thing to do would be to check that the server is still able to serve its core responsibility. That’s the part you know best – so this should be easy. In our case, we do a proxy curl and check for the response.

If the test passes, we’re done with that instance and continue with the loop. If not… well, in our example we should at least make one more try, since one single request over the world wide web can always fail, that’s not really a concern. But what if consecutive tests fail?

We just have to terminate that instance. Since we are part of an auto scaling group, the auto scaling group will detect that

a) the current number of instances is below the minimum number of instances defined or

b) the CPU utilization is above the configured threshold (or whatever your alarm will look like.)

In any of the two cases, this will prompt the auto-scaling group to start an additional instance.

Some additional notes:

1. To avoid any concurrency problems, it may be a good idea to check that our server has completely started yet, before the tests inside your script start checking its health. Therefore, in our example all instances with a lifetime less than 10 minutes are ignored.

2. Another solution could be to not terminate the instances directly but to set the instance’s health status to unhealthy. If you have an alarm configured to check the health status, the termination is done by the auto scaling group itself. In our example here, we don’t do that because we might loose some additional time with that solution and want our proxy recovered as fast as possible.

3. Of course the Amazon AWS API is secure and can’t be just be called from everywhere by anyone.
So you have to make sure that your script is running with sufficient privileges.

Below you’ll find an example bash script for our proxy server.
Aside from that there is an AWS API for Java, Javascript, Python, Ruby, etc…


#!/bin/bash -x

for i in $( /usr/local/bin/aws elb describe-load-balancers --query 'LoadBalancerDescriptions[?DNSName==`my-proxy-elb..elb.amazonaws.com`].{id:Instances[*].InstanceId}' --output text ); do
case $i in
        'ID')
             continue
         ;;
        *)
            iid=$i
            ltime=$( date +%Y-%m-%dU%H:%m --date '10 minutes ago' )
            iip=$( /usr/local/bin/aws ec2 describe-instances --filters  "Name='instance-id',Values='${iid}'" --query 'Reservations[*].Instances[?LaunchTime<=`'${ltime}'`].[PrivateIpAddress]' --output text )
            if [[ -n $iip ]]; then
            for x in `seq 1 3`; do
               try=$x   
               http_proxy="http://${iip}:80/" curl -s http://www.google.com 2>&1  1>/dev/null
               [[ $? -ne 0 ]] &&  sleep 3 && continue;
               break;
            done
            [[ $try == 3 ]] && { 
                    logger -n 127.0.0.1 -P 5140 "AWS WATCH: killing instance $iid"
                    /usr/local/bin/aws ec2 terminate-instances --instance-ids $iid
            }
            fi
        ;;
esac

done

6 thoughts on “Increasing the stability of your enterprise systems running in the AWS cloud

  1. Pingback: Zero downtime updates with AWS | Thinking Mobile

  2. Pingback: AWS Week in Evaluation – July 6, 2015 | Posts

  3. Pingback: AWS Week in Review – July 6, 2015 | SMACBUZZ

  4. Pingback: AWS Week in Review – July 6, 2015 - Browser Zone

  5. Pingback: AWS Week in Review – July 6, 2015 | wart1949

  6. Pingback: AWS Week in Review – July 6, 2015 | php Technologies

Leave a Reply

Your email address will not be published. Required fields are marked *