Using Pipelines to Manage Environments

Tools like Terraform, AWS CloudFormation, Azure Resource Manager Templates, Google Cloud Deployment Manager Templates and OpenStack Heat are a great way to define server infrastructure for deploying software. The configuration to provision, modify, and rebuild an environment is captured in a way that is transparent, repeatable, and testable. Used right, these tools give us confidence to tweak, change, and refactor our infrastructure easily and comfortably.

But as most of us discover after using these tools for a while, there are pitfalls. Any automation tool that makes it easy to roll out a fix across a sprawling infrastructure also makes it easy to roll out a cock-up across a sprawling infrastructure. Have you ever corrupted the /etc/hosts files on every server in your non-production estate, making it impossible to ssh into any of them — or to run the tool again to fix the error? I have.

What we need is a way to make and test changes safely, before applying them to environments that you care about. This is software delivery 101 — always test a new build in a staging environment before deploying it to live. But the best way to structure your infrastructure code to do this isn’t necessarily obvious.

I’m going to describe a few different ways people do this:

  • Put all of the environments into a single stack
  • Define each environment in a separate stack
  • Create a single stack definition and promote it through a pipeline

The nutshell is that the first way is bad; the second way works well for simple setups (two or three environments, not many people working on them), and the third has more moving parts, but works well for larger and more complex groups.

Before diving in, here are a couple of definitions: I use the term stack (or stack instance) to refer to a set of infrastructure that is defined and managed as a unit. This maps directly to an AWS CloudFormation stack, and also to the set of infrastructure corresponding to a Terraform state file. I also talk about a stack definition as a file, or set of files, used by a tool to create a stack. This could be a folder with a bunch of *.tf files for Terraform, or a folder of CloudFormation template files.

I’ll show some code examples using Terraform and AWS, but the concepts apply to pretty much any declarative, “Infrastructure as Code” tool, and any automated, dynamic infrastructure platform.

I’ll also assume two environments — staging and production, for the sake of simplicity. Many teams end up with more environments — development, QA, UAT, performance testing, etc. — but again, the concepts are the same.

One stack with all the environments

This is the most straightforward approach, the one that most people start out using, and the most problematic. All of the environments, from development to production, are defined in a single stack definition, and they are all created and managed as a single stack instance.

Multiple environments managed as a single stack

Multiple environments managed as a single stack

The code example below shows a single Terraform configuration for both staging and production environments:

# STAGING ENVIRONMENT
resource "aws_vpc" "staging_vpc" {
 cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "staging_subnet" {
 vpc_id = "${aws_vpc.staging_vpc.id}"
 cidr_block = "10.0.1.0/24"
}

resource "aws_security_group" "staging_access" {
 name = "staging_access"
 vpc_id = "${aws_vpc.staging_vpc.id}"
}

resource "aws_instance" "staging_server" {
 instance_type = "t2.micro"
 ami = "ami-ac772edf"
 vpc_security_group_ids = ["${aws_security_group.staging_access.id}"]
 subnet_id = "${aws_subnet.staging_subnet.id}"
}

# PRODUCTION ENVIRONMENT
resource "aws_vpc" "production_vpc" {
 cidr_block = "10.0.0.0/16"
}

resource "aws_subnet" "production_subnet" {
 vpc_id = "${aws_vpc.production_vpc.id}"
 cidr_block = "10.0.1.0/24"
}

resource "aws_security_group" "production_access" {
 name = "production_access"
 vpc_id = "${aws_vpc.production_vpc.id}"
}

resource "aws_instance" "production_server" {
 instance_type = "t2.micro"
 ami = "ami-ac772edf"
 vpc_security_group_ids = ["${aws_security_group.production_access.id}"]
 subnet_id = "${aws_subnet.production_subnet.id}"
}

This is a simple approach that makes everything visible in one place, but it doesn’t isolate the environments from one another. Making a change to the staging environment risks breaking production. And resources may leak or become confused across environments, making it even easier to accidentally affect an environment you don’t mean to.

Charity Majors shared problems she ran into with this approach using Terraform. The blast radius (great term!) of a change is everything included in the stack. And note that this is still true even with tools that don’t use statefiles as Terraform does. Defining multiple environments in a single CloudFormation stack is asking for trouble.

A Separate stack definition for each environment

Charity (and others) suggests splitting your environments into separate stack definitions. Each environment would have its own directory with its own Terraform configuration:

./our-project/staging/main.tf
./our-project/production/main.tf

These two different configurations should be identical, or nearly so. By running your infrastructure tool against each of these separately, you isolate the environments from one another (at least when it comes to using the tool, although obviously they may or may not be isolated in terms of networking, cloud account permissions, etc.) And because each environment has its own set of definition files, the intended state of each environment is very clear.

Each environments managed as its own stack

When you need to make a change, you edit the files for the staging environment stack, apply them, and test them. Iterate until it works the way you want. Then make the same changes to the files for the production environment and apply them to the production stack instance.

A drawback with this approach is that it’s easy for differences to creep into environments. Some of these differences may be accidental, as when someone in a rush makes a fix directly to the production environment configuration, and forgets to go back to the other environments to make the same change.

But often, as with statically defined environments, people try things out in different environments, and leave them in when they’re distracted by other tasks. Over time, each environment becomes a snowflake. Even though each environment is defined as code, differences between the environments means there is no trust that a given change can be quickly and safely applied across them all.

So maintaining a separate definition file per environment requires vigilance and discipline to keep them consistent.

Modules can be used to share code across environments, which can help with consistency. But modules can also create tight coupling. Making a change to a shared module requires ensuring the change won’t break other environments. The ability to test a change to a module in one environment before applying it to production adds complexity for versioning and release management.

One stack definition managed with a pipeline

An alternative is to use a continuous delivery pipeline to promote a stack definition file across environments. Each environment has its own stack instance, so the blast radius for a change is contained to the environment.

But a single stack definition is re-used to create and update each environment. The definition can be parameterized, to capture differences between instances, such as cluster sizing. The definition file is versioned, so we have visibility of what code was used for any environment, at any point in time.

One definition used for multiple environments

A single definition file used to create multiple stack instances in a pipeline

The moving parts for implementing this are: a source repository, such as a git repository; an artefact repository; and a CI or CD server such as GoCD, Jenkins, etc.

A simple example workflow is:

  1. Someone commits a change to the source repository.
  2. The CD server detects the change and puts a copy of the definition files into the artefact repository, with a version number.
  3. The CD server applies the definition version to the first environment, then runs automated tests to check it.
  4. Someone triggers the CD server to apply the definition version to production.

Basic flow of a stack definition through a pipeline Basic flow of a stack definition through a pipeline

Benefits

Teams I work with use pipelines for infrastructure for the same reasons that development teams use pipelines for their application code. It guarantees that every change has been applied to each environment, and that automated tests have passed. We know that all the environments are defined and created consistently.

With the previous “one stack definition per environment” approach, creating a new environment requires creating a new folder with its own copy of the files. These files then need to be maintained and kept up to date with changes made to other environments.

But the the pipeline approach is more flexible. New environment instances can be spun up on demand, which has a number of benefits:

  • Developers can create their own sandbox instances, so they can deploy and test cloud-based applications, or work on changes to the environment definitions, without conflicting with other team members.
  • Changes and deployments can be handled using a blue-green approach — create a new instance of the stack, test it, then swap traffic over and destroy the old instance.
  • Testers, reviewers, and others can create environments as needed, tearing them down when they’re not being used.

Artefact repository, promotion, and versioning

The pipeline treats a stack definition as a versioned artefact, promoting it from one stage to the next. As per Continuous Delivery, an artefact is never changed. This is why I suggest a separate artefact repository in addition to the version control repository. The version control is used to manage changes to the definition, the artefact repository is used to preserve immutable, versioned artefacts. However, I’ve seen people use their code repository for both, which works fine as long as you ensure that the principle of not making changes to a definition version once it’s been “published” and used for any environment.

For stack definitions, I like to use an S3 bucket (or an equivalent) as a repository. I have a pipeline stage that “publishes” the definitions by creating a folder in the bucket with a version number in the name, and copying the files into it. This code is executed by a CI/CD server agent:

aws s3 sync ./our-project/ s3://our-project-repository/1.0.123/

Promoting a version of a stack definition can be done in different ways. With the S3 bucket, I’ll sometimes have a folder named for the environment, and copy the files for the relevant version into that folder. Again, this code runs from a CI/CD agent:

aws s3 sync — delete \
 s3://our-project-repository/1.0.123/ \
 s3://our-project-repository/staging/

The pipeline stage for any environment simply runs Terraform (or CloudFormation), grabbing the definitions from the relevant folder.

Running the tool

The commands are pretty much all run by the CI or CD server on an agent. This does a few good things. First, it ensures that the command is fully automated. There’s no relying on “just one” manual step or tweak, which expands with “just one more, for now”. You don’t end up with different people doing things in their own way. You don’t end up with undocumented steps. And you don’t have to worry, in an emergency situation, that someone will forget that crucial step and bork the environment. If there’s a step that can’t easily be automated, then find a way.

Another good thing is that you don’t worry about different people applying changes to the same environment. Nobody applies changes to a common environment by running the tool from their own machine. No need to worry about locking and unlocking. No need to worry about locally edited changes. The only changes made to an environment you care about are the ones that have been fed into the pipeline.

This also helps with consistency. Nobody can make a change that requires a special plugin or utility they’ve installed on their own laptop. There is no tweak that someone made when applying a change to staging, that might be forgotten for production. The tool is always run from a CD agent, which is built and configured consistently, always using the same script, no matter which environment it runs from.

Developer workflow

People working on the infrastructure probably need to make and test changes to stack definitions rapidly, before committing into the pipeline. This is the one place you may want to run the tool locally. But when you do, you’re doing it with your own instance of the environment stack. Pull the latest copy of the definitions from the source repo, run the tool to create the environment, then start making changes. (Write some automated tests, of course, which should also be in the source repo and automatically run in relevant environments in the pipeline.)

When you’re happy, make sure you’ve pulled the latest updates from master, and that your tests pass. Then commit the changes and destroy your environment.

No need to worry about breaking a common environment like staging. If someone else is working on changes that might conflict with yours, you should see them when you pull from master. If the changes merge OK, but cause issues, they should be caught in the first test environment in the pipeline. The build goes red, everyone stops and looks to see what needs to be done to fix it.

When is this appropriate?

I’ve been using pipelines to manage infrastructure for years. It seemed like an obvious way to do it, since ThoughtWorks teams use pipelines for releasing software as a matter of course. But recently I’ve been running across experienced people who maintain separate definition file for each environment.

In some cases I think people simply haven’t been exposed to the idea of using a pipeline for infrastructure. Hence this article. But I’m also aware that these two different approaches (and maybe others I’m not aware of) are appropriate for different contexts.

The pipeline approach has more moving parts than simply having a separate stack definition for each environment. For smaller teams, with a simple release process and not many environments, a pipeline may be overkill. But when you have more people working on a system, and especially when there are multiple teams and roles, a pipeline creates a consistent, reliable process for rolling out infrastructure changes. It may also make sense even for smaller teams who are already using a pipeline for their software to add their infrastructure files into it, rather than having a separate process for environments and applications.

There are some caveats. It takes time to get comfortable with making infrastructure changes using a pipeline. You may not get as much value without at least some degree of automated testing, which is another discipline to build. Running infrastructure management tools automatically from a CD server agent requires a secure approach to managing secrets, to avoid your automation system from becoming an easy attack vector.

Beyond pipelines

The pipeline approach I’ve described answers the question of how to manage multiple instances of a particular environment, in a “path to production”. There are other challenges with managing a complex infrastructure codebase. How do you share code, infrastructure, and/or services between teams? What’s the best way to structure and release modules of shared infrastructure code?

I’ve also spent some time thinking about patterns I’ve seen for splitting a single environment into multiple stacks, which I described in a talk. If this article is about applying principles and practices from Continuous Delivery to infrastructure, that talk was about applying principles from Microservices to infrastructure.

I hope this article gives people food for thought, and would like to hear what other approaches people have tried and found useful.

Acknowledgements

Aside from inspiration from Charity Majors and Yevgeniy Brikman, I received feedback from my ThoughtWorks colleagues Nassos Antoniou, Andrew Langhorn, Pat Downey, Rafael Gomes, Kevin Yeung, and Joe Ray, and also from Antonio Terreno.

This article was originally posted on Medium and on ThoughtWorks Insights.


Infrastructure as Code has been published

Infrastructure as Code is officially published! The “Pre-order” buttons on Amazon.com and the O’Reilly shop have flipped to “Add to Cart” for both print and e-book formats. I’m told that “boxes and boxes” of the book have arrived in the ThoughtWorks office, although I’m not in today so I don’t have my greasy mitts on a copy yet.

Links to various sites to order the book are over on the right of this page.

Screenshot of the order page on Amazon

This book has been a long haul. It’s hard to express how important the support of so many people has been. ThoughtWorks has been incredibly supportive, on so many levels. I doubt I would have done it without the examples and inspiration of my amazing colleagues across the globe. Martin Fowler has been a tremendous mentor. Rong Tang did the artwork, showing infinite patience as she turned inconsistent, muddled, and incoherent scribbles into great looking imagery.

I owe a great debt to the people of the DevOpsDays community, who are collectively responsible for the ideas of DevOps and Infrastructure as Code. I took a stab at collating what I’ve learned from many peoples’ ideas and experiences into something that hopefully will help other people as much as this community has helped me.


Different models for updating servers

Most teams begin using automation tools like Terraform, Puppet, Chef, and Ansible to provision new infrastructure, but don’t use them regularly to make changes and apply updates once the systems are running. Building and configuring a new system from scratch is fairly easy. But writing definition files and scripts that will run reliably across a large number of existing systems is hard.

Chances are, things will have changed on some of those servers. For example, if you have a dozen web servers serving various websites, some sites will have needed an additional plugin or two, configuration tweaks, or even system packages. Problems will have cropped up and been fixed on a few of the servers, but not others.

The little differences and inconsistencies that accumulate between servers over time are Configuration Drift. Configuration drift makes it unlikely that a Playbook, Cookbook, Manifest, etc. will run reliably over all of the servers, which leads to the Automation Fear Spiral. This prevents teams from using automated configuration as effectively as they could.

Rusty car

Models for updating servers

So a key part of any team’s infrastructure management approach is how to make changes to existing servers. A good automated change process should be easy and reliable, so that making changes outside the process - logging in and installing a package, for example - just feels wrong.

I summarize four models for updating servers in chapter 4 of the infrastructure book, and use them throughout. As with any model, this is just a convenience. Many teams will do things that don’t quite fit any one of these models, which is fine if it works for them. The purpose of this is to give us ideas of what might work for us.

The models are:

  • Ad-hoc change management
  • Configuration synchronization
  • Immutable infrastructure
  • Containerized services

Each of this is explained in a bit more detail below.

Ad Hoc Change Management

Ad hoc change management makes changes to servers only when a specific change is needed. This is the traditional, pre-automation approach - log into a server, edit files, install packages, and create user accounts. It still seems to be the most common approach even for people using automation tools like Ansible, Chef, and Puppet. People write or modify a configuration definition and then manually run the tool to apply it to a subset of servers. They don’t run the configuration tool unless they have a specific change they want to make.

The problem with this is that it leads to configuration drift and the automation fear cycle, exactly as described above. Often, some servers can go a while without having the automation tool run on them. When someone finally does try to run it, the number of changes made are so large that it’s almost guaranteed that something will break.

Configuration Synchronization

Configuration synchronization repeatedly applies configuration definitions to servers, for example, by running a Puppet or Chef agent on an hourly schedule. This happens on all servers, regardless of whether any changes have been made to the definitions.

Doing this ensures that any changes made outside of the automation are brought back into line with the definitions. This discourages ad-hoc changes. It also guarantees that every server is up to date, having had all of the current configuration definitions applied.

Regularly applying configuration to all servers also speeds up the feedback cycle for changes, and simplifies finding and fixing problems. When someone rolls out a new change, for example applying a security patch, they can be confident it is the only change being made to systems. This gives them a smaller area to look for the issue, and lower impact for fixing it or rolling it back.

Configuration synchronization is probably the most common approach for teams with a mature infrastructure as code approach. Most server configuration tools, including Chef and Puppet, are designed with this approach in mind.

It’s important to have good monitoring that detects issues quickly, so any problems with a definition can be flagged and fixed. A change management pipeline, similar to a Continuous Delivery pipeline, can be used to automatically deploy changes to a test environment and run tests before allowing it to be applied to production.

The main limitation of configuration synchronization is that it’s not feasible to have configuration definitions covering a significant percentage of a server. So this leaves large parts of a server unmanaged, leaving it open to configuration drift.

Immutable Infrastructure

Teams using Immutable infrastructure make configuration changes by completely replacing servers. A change is made by building a new version of a server template (such as an AMI), and then rebuilding whatever servers are based on that particular template. This increases predictability, since there is little variance between servers as tested, and servers in production.

Immutable infrastructure requires mature processes and tooling for building and managing server templates. Packer is the go-to tool for building server images. As with configuration synchronization, a pipeline to automatically test and roll out server images is useful.

Containerized Services

Containerized services works by packaging applications and services in lightweight containers (as popularized by Docker). This reduces coupling between server configuration and the things that run on the servers.

Container host servers still need to be managed using one of the other models. However, they tend to be very simple, because they don’t need to do much other than running containers.

Most of the team’s effort and attention goes into packaging, testing, distributing, and orchestrating the services and applications, but this follows something similar to the immutable infrastructure model, which again is simpler than managing the configuration of full-blow virtual machines and servers.


Dynamic Infrastructure Platforms

A dynamic infrastructure platform is a fundamental requirement for Infrastructure as Code. I define this as “a system that provides computing resources, particularly servers, storage, and networking, in a way that they can be programmatically allocated and managed.”

In practice, this most often means a public IaaS (Infrastructure as a Service) cloud like Amazon’s AWS, Google’s GCE, or Microsoft’s Azure. But it can also be a private cloud platform using something like OpenStack or VMware vCloud. A dynamic infrastructure platform can also be implemented with an API-driven virtualization system like VMware. These systems normally force your infrastructure management tools to explicitly decide where to allocation resources - which hypervisor instance to start a VM on, which storage pool to allocate a network share from, etc. But this is still compatible with Infrastructure as Code, because it’s all programmable.

Many organizations, including DevOps paragons like Etsy and Spotify, implement Infrastructure as Code on bare-metal, with no virtualization or cloud at all. Tools such as Cobbler or Foreman can be used to automatically provision physical servers, leveraging ILO (Integrated Lights Out) features of the server hardware.

The key characteristics needed from an infrastructure platform for Infrastructure as Code are:

  • Programmable
  • On-demand
  • Self-service

Programmable

A dynamic infrastructure platform must be programmable. An API makes it possible for scripts, software, and tools to interact with the platform. Even if you’re using an off-the-shelf tool like Terraform or Ansible to provision infrastructure, you’ll almost certainly need to write some custom scripting or tools here and there. So you should make sure the platform’s API has good support for scripting languages that your team is comfortable with. Keep in mind the difference between “good” support for the language, and just having a tickbox.

On-Demand

The dynamic infrastructure platform needs to allow resources to be created and destroyed immediately. You would think this is obvious, but it’s not always the case. Some managed hosting providers, and internal IT departments, offer services they call “cloud”, but which require raising tickets to get someone else to make it happen. The hosting platform needs to be able to fulfill provisioning requests within minutes, if not seconds.

Billing and budgeting also need to be structured to support on-demand, incremental charging. If you need to sign a contract, or issue a purchase order, in order to create a new server, then it’s not going to work. If adding a new server requires a commitment of more than an hour, it’s not going to work.

Also, if your “cloud” hosting provider charges you for the hardware you’ll be using, and then charges you for each VM you run, then you’re being taken advantage of. That’s not how cloud works.

Lego SHIELD helicarrier

Self-Service

Self-service takes the on-demand requirement, and adds a bit more. It’s not enough to be able to get resources like servers quickly, you need to be able to customize and tailor them yourself. You shouldn’t need to get someone else to approve how much RAM and how many CPU’s your server will have. You should be able to tweak and adjust these things on existing servers.

Specifying your environment’s details, and changing it, will actually be done in definition files (like a Terraform file), using the platform’s programmable API. So any arrangement where a central group does this for you isn’t going to work.

I like the analogy fo Lego bricks. A central IT group that manages your cloud for you is like buying a box of Lego bricks, but having the shop staff decide how to assemble them for you. It stops you from taking ownership of the infrastructure you use. You won’t be able to learn how to to shape your infrastructure to your own needs and improve it over time.

Worse is when a central IT team offers you a catalog of pre-defined infrastructure. This is like only being able to buy a Lego set that has already been built for you and glued together. You’ve got no ability to adjust and improve it. You often can’t even request a change, such as a newer version of a JVM. Instead, you have to wait for the central group to build and test a new standard offering.

What you want

Ultimately, your infrastructure platform needs to give you the ability to define your infrastructure in files, and have your tools provision and update that infrastructure. This reduces your reliance on an overworked central team, and ensures you can continuously improve and adapt your infrastructure to support the application you run on it as effectively as possible.


Why Infrastructure as Code

question mark The thumbnail definition that I trot out for Infrastructure as Code is using development practices and tools to manage infrastructure. This sounds like a natural thing to do, if you’re defining your infrastructure in definitions files used by tools like Chef, Puppet, and Ansible. These files look like source code, and can be checked into Git or other VCS systems like source code.

But what are the actual benefits of treating your infrastructure this way? Configuring infrastructure by editing files in a VCS is a dramatically different way of working than the old-school alternatives - clicking in a GUI-driven configuration, or logging into servers and editing configuration files. To make this shift, and to really get the benefits from it, you need to be pretty clear on what you’re trying to get out of it.

The headline benefits of Infrastructure as Code are to be able to easily and responsibly manage changes to infrastructure. We’d like to be able to make changes rapidly, with low risk. And we’d like to keep doing this even as the size and complexity of the infrastructure grows, and as more teams are using our infrastructure.

The enemy of this goal is manually-driven processes. Manual steps to provision, configure, modify, update, and fix things are the most obvious things to eliminate. But manually-driven process and governance can be at least as big an obstacle to frequent, low-risk changes. This becomes especially difficult to handle as an organization grows.

So what kind of benefits should you see from a well-implemented Infrastructure as Code approach?

  • Your IT infrastructure supports and enables change, rather than being an obstacle or a constraint for its users.
  • Changes to the system are routine, without drama or stress for users or IT staff.
  • IT staff spends their time on valuable things which engage their abilities, not on routine, repetitive tasks.
  • Users are able to responsibly define, provision, and manage the resources they need, without needing IT staff to do it for them.
  • Teams are able to easily and quickly recover from failures, rather than assuming failure can be completely prevented.
  • Improvements are made continuously, rather than done through expensive and risky “big bang” projects.
  • You find solutions to problems by implementing, testing, and measuring them, rather than by discussing them in meetings and documents.

(Photo by Sebastien Wiertz)