Zero downtime updates with AWS

schild1

In a previous post we explained how the stability of an enterprise system running in the AWS Cloud can be increased.

Another important point that affects the overall availability of your services are planed downtimes.

These can have several reasons:

  1. Hardware migration or upgrades (e.g. increasing the server’s memory or switching to a more powerful server)
  2. System migration or upgrades (e.g. changing or upgrading the operating system)
  3. Database updates
  4. Application updates

If you run your service in a non-cloud infrastructure, it might be work-intensive and time-consuming to deal with all these points (if it is even possible to make a zero downtime update at all). But it can be quite easy to achieve zero downtime updates when using Amazons EC2 infrastructure. Let me show you how.

Prerequisites
Our sample service used here consists of a cluster of several nodes in an auto scaling group. In front of them, there is a load balancer to distribute the incoming requests to the connected nodes. To make the updates less complicated, the particular instances should follow the most important rule for cloud instances: Don’t persist any kind of state on a single node.

This means:

  1. Use a shared database for your business data
  2. Don’t use files on the local file system
  3. Fire events to all nodes (e.g. using JGroups) instead of just changing the data only in the memory of one node

Hardware migration or upgrades
This can be handled by switching the instance type of our nodes to a bigger one (e.g. changing from amazons c3.large to c3.xlarge). Just create a new ‘launch configuration’ using another instance type and assign this launch configuration to your auto scaling group afterwards. Terminate the running instances one by one waiting till each instance has been initialized completely and you’re done. System migration or upgrades. This part depends on how the instances are initialized and managed.

The preferred way would be to initialize the nodes during the first boot cycle. This can be done with ‘cloud-init’ to configure the setup of the instance. For example add your private key to access the instance via ssh. But this can even go much further: you can install a server with your latest or a specific software version loaded from the Amazon S3 and start it. If you have configured to always use the latest version of the operation system, all you have to do is, again, terminate the running instances one by one waiting till each instance has been initialized completely.

If you haven’t configured your instances this way, but you’re managing the system yourself. See the famous cloud analogy ‘pets’ vs. ‘cattle’: link, zero downtime updates will be more work, but are also possible. The update process is very similar to the steps described below when updating the application. The main difference is that in step 2 you don’t update the application itself but the underlying system.

Database updates
When talking about database updates here, updates of the database schema or it’s data is meant. Updating the software version of the database itself would be one variant of a system update handled in the paragraph before.

Please note: database updates are a tricky topic.
For one reason you have to make sure you don’t loose data when, for example, you mirror your database to a new instance, update this new instance and afterwards switch to this one. And the other point is that when changing the schema by adding new columns or removing existing ones, you have to make sure both your old and your new application code can handle this circumstances. One approach would be to use NoSQL databases instead of traditional relational databases see.

Application updates
Application updates were usually disliked in the past because of the downtime that has to be coordinated and communicated to all affected customers. And due to moving towards continuous deployment these updates are even more often nowadays. This increases the need for zero downtime updates for applications.

If the application is running in an auto scaling group this can be reached pretty simple, but it differs a little from how the instances are managed (see part ‘system migration or upgrades’).

First of all we describe how the update process takes place, if the instance is managed manually (‘pet’ instance):

  1. Detach node from auto scaling group
    Pick one instance in your auto scaling group and run ‘Actions’ –> ‘Detach’. Possibly you have to pick the checkbox to add new instance, for fulfilling the desired minimum number of instances of your auto scaling group.
  2. Update the node
    Since the chosen node is now detached from the auto scaling group, AWS’ load balancer will not forward traffic to this node anymore. This means you can now update your application without affecting the running service.
  3. Test the node
    After updating the instance you can make some smoke tests or more to ensure the correctness of the update and that everything is working as expected. Of course the tests have to be done directly against this node instead of against the load balancer with the non-updated nodes.
  4. Update auto scaling group
    Create a new image of your instance and configure your auto scaling group to use this image from now on.
  5. Terminate the first node
    If you run a cluster with two or more nodes you can terminate one node so that the auto scaling group will detect the missing node and start a new instance to add this one to the auto scaling group. Your cluster is now running with a mix of both updated and non-updated nodes. If necessary you can now make some more tests (now against the load balancer).
  6. Terminate remaining nodes
    After you’ve confirmed that your services is running smoothly with one updated node, you can terminate the remaining nodes with the former software version one after one to get your cluster updated completely.

If the instances will be set up automatically by scripts (‘cattle’ instances), only the first two bullet points differ:

  1. Change script
    Change the init-script to use the new version or script anything else that will be necessary to start the instances with the desired version.
  2. Start detached node
    Start the node at first standalone, meaning without adding it to the auto scaling group. The other points stay the same.

2 thoughts on “Zero downtime updates with AWS

  1. Pingback: AWS Week in Review – November 9, 2015 | SMACBUZZ

  2. Pingback: AWS Week in Review – November 9, 2015 | wart1949

Leave a Reply

Your email address will not be published. Required fields are marked *