Most teams begin using automation tools like Terraform, Puppet, Chef, and Ansible to provision new infrastructure, but don’t use them regularly to make changes and apply updates once the systems are running. Building and configuring a new system from scratch is fairly easy. But writing definition files and scripts that will run reliably across a large number of existing systems is hard.
Chances are, things will have changed on some of those servers. For example, if you have a dozen web servers serving various websites, some sites will have needed an additional plugin or two, configuration tweaks, or even system packages. Problems will have cropped up and been fixed on a few of the servers, but not others.
The little differences and inconsistencies that accumulate between servers over time are Configuration Drift. Configuration drift makes it unlikely that a Playbook, Cookbook, Manifest, etc. will run reliably over all of the servers, which leads to the Automation Fear Spiral. This prevents teams from using automated configuration as effectively as they could.
Models for updating servers
So a key part of any team’s infrastructure management approach is how to make changes to existing servers. A good automated change process should be easy and reliable, so that making changes outside the process - logging in and installing a package, for example - just feels wrong.
I summarize four models for updating servers in chapter 4 of the infrastructure book, and use them throughout. As with any model, this is just a convenience. Many teams will do things that don’t quite fit any one of these models, which is fine if it works for them. The purpose of this is to give us ideas of what might work for us.
The models are:
- Ad-hoc change management
- Configuration synchronization
- Immutable infrastructure
- Containerized services
Each of this is explained in a bit more detail below.
Ad Hoc Change Management
Ad hoc change management makes changes to servers only when a specific change is needed. This is the traditional, pre-automation approach - log into a server, edit files, install packages, and create user accounts. It still seems to be the most common approach even for people using automation tools like Ansible, Chef, and Puppet. People write or modify a configuration definition and then manually run the tool to apply it to a subset of servers. They don’t run the configuration tool unless they have a specific change they want to make.
The problem with this is that it leads to configuration drift and the automation fear cycle, exactly as described above. Often, some servers can go a while without having the automation tool run on them. When someone finally does try to run it, the number of changes made are so large that it’s almost guaranteed that something will break.
Configuration synchronization repeatedly applies configuration definitions to servers, for example, by running a Puppet or Chef agent on an hourly schedule. This happens on all servers, regardless of whether any changes have been made to the definitions.
Doing this ensures that any changes made outside of the automation are brought back into line with the definitions. This discourages ad-hoc changes. It also guarantees that every server is up to date, having had all of the current configuration definitions applied.
Regularly applying configuration to all servers also speeds up the feedback cycle for changes, and simplifies finding and fixing problems. When someone rolls out a new change, for example applying a security patch, they can be confident it is the only change being made to systems. This gives them a smaller area to look for the issue, and lower impact for fixing it or rolling it back.
Configuration synchronization is probably the most common approach for teams with a mature infrastructure as code approach. Most server configuration tools, including Chef and Puppet, are designed with this approach in mind.
It’s important to have good monitoring that detects issues quickly, so any problems with a definition can be flagged and fixed. A change management pipeline, similar to a Continuous Delivery pipeline, can be used to automatically deploy changes to a test environment and run tests before allowing it to be applied to production.
The main limitation of configuration synchronization is that it’s not feasible to have configuration definitions covering a significant percentage of a server. So this leaves large parts of a server unmanaged, leaving it open to configuration drift.
Teams using Immutable infrastructure make configuration changes by completely replacing servers. A change is made by building a new version of a server template (such as an AMI), and then rebuilding whatever servers are based on that particular template. This increases predictability, since there is little variance between servers as tested, and servers in production.
Immutable infrastructure requires mature processes and tooling for building and managing server templates. Packer is the go-to tool for building server images. As with configuration synchronization, a pipeline to automatically test and roll out server images is useful.
Containerized services works by packaging applications and services in lightweight containers (as popularized by Docker). This reduces coupling between server configuration and the things that run on the servers.
Container host servers still need to be managed using one of the other models. However, they tend to be very simple, because they don’t need to do much other than running containers.
Most of the team’s effort and attention goes into packaging, testing, distributing, and orchestrating the services and applications, but this follows something similar to the immutable infrastructure model, which again is simpler than managing the configuration of full-blow virtual machines and servers.
A dynamic infrastructure platform is a fundamental requirement for Infrastructure as Code. I define this as “a system that provides computing resources, particularly servers, storage, and networking, in a way that they can be programmatically allocated and managed.”
In practice, this most often means a public IaaS (Infrastructure as a Service) cloud like Amazon’s AWS, Google’s GCE, or Microsoft’s Azure. But it can also be a private cloud platform using something like OpenStack or VMware vCloud. A dynamic infrastructure platform can also be implemented with an API-driven virtualization system like VMware. These systems normally force your infrastructure management tools to explicitly decide where to allocation resources - which hypervisor instance to start a VM on, which storage pool to allocate a network share from, etc. But this is still compatible with Infrastructure as Code, because it’s all programmable.
Many organizations, including DevOps paragons like Etsy and Spotify, implement Infrastructure as Code on bare-metal, with no virtualization or cloud at all. Tools such as Cobbler or Foreman can be used to automatically provision physical servers, leveraging ILO (Integrated Lights Out) features of the server hardware.
The key characteristics needed from an infrastructure platform for Infrastructure as Code are:
A dynamic infrastructure platform must be programmable. An API makes it possible for scripts, software, and tools to interact with the platform. Even if you’re using an off-the-shelf tool like Terraform or Ansible to provision infrastructure, you’ll almost certainly need to write some custom scripting or tools here and there. So you should make sure the platform’s API has good support for scripting languages that your team is comfortable with. Keep in mind the difference between “good” support for the language, and just having a tickbox.
The dynamic infrastructure platform needs to allow resources to be created and destroyed immediately. You would think this is obvious, but it’s not always the case. Some managed hosting providers, and internal IT departments, offer services they call “cloud”, but which require raising tickets to get someone else to make it happen. The hosting platform needs to be able to fulfill provisioning requests within minutes, if not seconds.
Billing and budgeting also need to be structured to support on-demand, incremental charging. If you need to sign a contract, or issue a purchase order, in order to create a new server, then it’s not going to work. If adding a new server requires a commitment of more than an hour, it’s not going to work.
Also, if your “cloud” hosting provider charges you for the hardware you’ll be using, and then charges you for each VM you run, then you’re being taken advantage of. That’s not how cloud works.
Self-service takes the on-demand requirement, and adds a bit more. It’s not enough to be able to get resources like servers quickly, you need to be able to customize and tailor them yourself. You shouldn’t need to get someone else to approve how much RAM and how many CPU’s your server will have. You should be able to tweak and adjust these things on existing servers.
Specifying your environment’s details, and changing it, will actually be done in definition files (like a Terraform file), using the platform’s programmable API. So any arrangement where a central group does this for you isn’t going to work.
I like the analogy fo Lego bricks. A central IT group that manages your cloud for you is like buying a box of Lego bricks, but having the shop staff decide how to assemble them for you. It stops you from taking ownership of the infrastructure you use. You won’t be able to learn how to to shape your infrastructure to your own needs and improve it over time.
Worse is when a central IT team offers you a catalog of pre-defined infrastructure. This is like only being able to buy a Lego set that has already been built for you and glued together. You’ve got no ability to adjust and improve it. You often can’t even request a change, such as a newer version of a JVM. Instead, you have to wait for the central group to build and test a new standard offering.
What you want
Ultimately, your infrastructure platform needs to give you the ability to define your infrastructure in files, and have your tools provision and update that infrastructure. This reduces your reliance on an overworked central team, and ensures you can continuously improve and adapt your infrastructure to support the application you run on it as effectively as possible.
The thumbnail definition that I trot out for Infrastructure as Code is using development practices and tools to manage infrastructure. This sounds like a natural thing to do, if you’re defining your infrastructure in definitions files used by tools like Chef, Puppet, and Ansible. These files look like source code, and can be checked into Git or other VCS systems like source code.
But what are the actual benefits of treating your infrastructure this way? Configuring infrastructure by editing files in a VCS is a dramatically different way of working than the old-school alternatives - clicking in a GUI-driven configuration, or logging into servers and editing configuration files. To make this shift, and to really get the benefits from it, you need to be pretty clear on what you’re trying to get out of it.
The headline benefits of Infrastructure as Code are to be able to easily and responsibly manage changes to infrastructure. We’d like to be able to make changes rapidly, with low risk. And we’d like to keep doing this even as the size and complexity of the infrastructure grows, and as more teams are using our infrastructure.
The enemy of this goal is manually-driven processes. Manual steps to provision, configure, modify, update, and fix things are the most obvious things to eliminate. But manually-driven process and governance can be at least as big an obstacle to frequent, low-risk changes. This becomes especially difficult to handle as an organization grows.
So what kind of benefits should you see from a well-implemented Infrastructure as Code approach?
- Your IT infrastructure supports and enables change, rather than being an obstacle or a constraint for its users.
- Changes to the system are routine, without drama or stress for users or IT staff.
- IT staff spends their time on valuable things which engage their abilities, not on routine, repetitive tasks.
- Users are able to responsibly define, provision, and manage the resources they need, without needing IT staff to do it for them.
- Teams are able to easily and quickly recover from failures, rather than assuming failure can be completely prevented.
- Improvements are made continuously, rather than done through expensive and risky “big bang” projects.
- You find solutions to problems by implementing, testing, and measuring them, rather than by discussing them in meetings and documents.
(Photo by Sebastien Wiertz)
I’ve delivered the text of my book to O’Reilly’s production team, which means we’re on the path to publication! The last “early access” release should be available soon to people who have bought it (which you can still do from the O’Reilly Shop), and then the final release will be out in stores.
The final early access push will be pretty much the final content, only missing copyediting (spelling, grammar, etc.) and professionally designed graphics.
My employer, ThoughtWorks, is sponsoring a free download of three chapters of my upcoming book, “Infrastructure as Code”. These chapters focus on software engineering, testing, and Continuous Delivery practices for infrastructure. ThoughtWorks has a deep history in all of these areas, so they seemed like an appropriate group of chapters for us to sponsor as a company.
I’ve now completed the full draft of the book. We’re getting technical reviews of the book, and have started getting the diagrams professionally designed. It’s awesome to see my crude attempts at diagrams turned into clean, slick images!