Announcing, the second edition of Infrastructure as Code

TL;DR: Go read the early release of the second edition of Infrastructure as Code on O’Reilly! This is the first eight chapters of what will probably be eighteen or so.

Cover of Infrastructure as Code 2nd edition

Four years ago I was close to finishing my book Infrastructure as Code. I felt like I was racing against the industry’s ability to innovate and improve infrastructure technology. At the time, the action was in server configuration - the book’s subtitle was Managing Servers in the Cloud. Docker, Kubernetes, and AWS Lambda were still new, few people were using them in production.

Since we published the book, I’m often asked whether infrastructure as code is relevant in the cloud-native world of serverless and service mesh. You might not be shocked to hear me say, “yes.” Even if you’re no longer worried about configuring packages and file permissions on virtual servers, you’re still better off using code to build your clusters and environments than building them by hand.

Looking over the first edition of the book takes me back to a different time. Most clients I worked with were building infrastructure on the cloud for the first time, and my ThoughtWorks colleagues and I were introducing them to automation as code. There is a lot of text in the first edition to help you explain to your skittish management why they shouldn’t fear public cloud.

The world now is different. The technical ecosystem is still in flux. But even the most risk adverse organizations - financial institutions, governments, healthcare organizations - are using public cloud to one degree or another. The question isn’t whether to use cloud and infrastructure as code, but how.

My typical client today already has an existing infrastructure codebase. Their challenge is that their system has sprawled into a complex morass of code. Tools have come a long way, but reusing, sharing, and organizing code is still not easy. People write complicated build scripts that are actually more fragile and confusing than the infrastructure code they apply. And automated testing for infrastructure is still a challenge.

Enter the second edition of Infrastructure as Code. What I first thought would be a gentle refresh has turned into an aggressive rewrite. Things are different, as I described above. But I’ve also spent a huge amount of time with many teams and people who are working with infrastructure projects. I’ve learned a lot since I wrote the first edition. I’ve learned about challenges, practices, and ideas for dealing with dynamic infrastructure. And I’ve learned better ways to communicate these.

So check out the early release, and please let me know what you think!

Platform service models

Many of our clients at ThoughtWorks are building internal services for development teams to use in developing, delivering, and running applications. I’m often asked what the model for this should be in terms of developer experience. Should developers be able to write their own infrastructure code? Should they package up and release their own Docker images on infrastructure that’s built for them?

I believe a layered approach is best. Teams should be able to reach for the level of tooling that suits their needs and capabilities. If there are solid services available within the organization, then they should be able to re-use those. If not, they should be able to use lower-level services provided for them, and follow examples and templates from other teams. If none of these exist, they should be able to build their own, following shared principles, policies, and governance.

Developer experience models

A good model for developer experience is:

  1. I prefer to just write my application code and push it to something that will run it. This is the serverless model, as well as the “build-pack” model used by Heroku and some PaaS platforms.
  2. If that’s not available or appropriate for my needs, then I prefer to package my code and runtime and push it. This is the container model, with a container orchestration service like Kubernetes or Nomad already available and running.
  3. If that’s not appropriate, I prefer to push my package with standardized Infrastructure as Code. That is, someone else has written infrastructure modules and/or code that I can grab and use to provision the infrastructure my application needs.
  4. If that’s not appropriate, then I prefer to write custom Infrastructure as Code in a standard tool (like Terraform, CloudFormation, etc.).
  5. If that’s not appropriate, I want to have an API (e.g. REST-based) so that I can write custom scripts.

A given application might rely on multiple platform services, some of which may be provided in different ways. For example, maybe I can write application code and push it to the platform to package and deploy it for me (#1), but I may need to write infrastructure code to provision and configure a database using a DBaaS model (#3), that my application connects to.

Clearly, #1 is the slickest thing to offer to your development teams. But it’s not practical to think your organization can provide this level of experience for everything every team will need, unless your needs are very simple.

In practice, centralized teams are best off first providing services in the simplest way possible, which will tend to be APIs, hopefully ones supported by standard tools. Typically this happens by installing packaged software, and by opening up access to public cloud and SaaS accounts, and giving teams access to go ahead and configure and use these according to their needs.

Then, central teams can focus on incrementally building shared services which have the most commonality and value across teams.

This incremental approach is essential to avoid the all-too-common Big Platform Programme. These involve spending piles of money, and months and years of time, followed by the release of an underwhelming “MVP”, typically heavily restricted and limited. The result of this kind of platform strategy is that teams working on high profile projects persuade their executive sponsors to approve them going directly to public cloud. Meanwhile teams on lower profile projects remain on legacy platforms.

Service sharing models

Another dimension of platform services is the service model. Again, this is a hierarchy:

  1. Shared managed service. As a product team, I can just interact with something that someone else keeps running. This is essentially SaaS, the true cloud model. Variations of this are a shared, multi-tenant service - all the customers are using a single instance of the service, vs. self-provisioned service. The latter is like a public cloud DBaaS database, where you can provision your own database instance. Either way, you will interact with it using one of the developer experience models described above.
  2. Shared service package. As a product team, I can spin up and run an instance of the service for myself, using a package provided for me. I am responsible for keeping it running, upgrading it, etc. This is essentially the packaged software model, although within an organization the package is probably more customized. An example is a server OS build, that has been hardened, had standard agents for monitoring etc., pre-installed.
  3. Shared code. As a product team, I can use code, templates, libraries, etc. that have been provided for me. For each shared code project, a central team provides a certain standard of support, manages versions, keeps it updated, etc.
  4. Example code. As a product team, I can copy code, templates, etc. from other teams. Those teams don’t commit to support me, won’t worry about releasing updates that will be easy for me to apply to my things, and may not even keep the code maintained.
  5. Principles, policies, and patterns. As a product team, I can write my own code to provide a service I need. I should follow shared principles (e.g. “Define infrastructure as code”), must follow policies and governance processes, and may follow patterns.

As with developer experience patterns, these can be built up incrementally from the bottom levels up. First, define common principles, policies, etc. that anyone who is building services should or must follow. Then, as teams begin building their own services, make it easy for them to share their code with each other. Identify services which are most commonly used, and build more maintained code libraries for those. Consider packaging them up so that teams can use them more easily. For some things which are very common, create managed services.

Rather than trying to define all of this up front, an evolutionary strategy allows services to emerge organically based on what is most important. Building shared services incrementally is more pragmatic, and creates value and feedback more quickly, than attempting to build a Big Platform Up Front.

Platform team models

This is a fairly big and complex topic. But briefly, I usually encourage organizations to avoid having a single big platform team, preferring multiple teams organized around concerns that hang together. Companies like Netflix, Spotify, and Etsy, who are all pretty good at this stuff, tend to have teams such as compute, observability, traffic, etc. This encourages building a collection of loosely coupled services and capabilities, as opposed to monolithic “platforms” which quickly become legacy.

These teams should also focus on building (and in some cases running) things for other teams to use, rather than using them for the other teams. For example, the conversation should not be, “Hey monitoring team, can you add some monitoring checks for my application?” Instead it should be, “Hey monitoring team, how do I add monitoring checks for my application?” The monitoring team keeps the monitoring service running, and provide support and help to teams who need it.

See also

The above is not official ThoughtWorks gospel, although it’s probably fairly well aligned. A bunch of smart ThoughtWorkers have defined and offer a more structured engagement model for clients working on platforms, the Digital Platform Strategy.


Thanks to current and former ThoughtWorks colleagues Peter Gillard-Moss, Karl Stoney, Moritz Heiber, and Bill Codding, for the email thread that spurred me to describe my thoughts on this. Particular thanks to Peter for encouraging me to blog it.

Term: Infrastructure Stack

The term Infrastructure Stack is something I’ve found useful to explain different patterns for organizing infrastructure code. An infrastructure stack is a collection of infrastructure elements defined and changed as a unit. Stacks are typically managed by tools such as Hashicorp Terraform, AWS CloudFormation, Azure Resource Manager Templates, Google Cloud Deployment Manager Templates and OpenStack Heat.

An infrastructure stack is a collection of infrastructure elements managed as a unit

AWS CloudFormation uses the term “stack” formally, but it maps to other tools as well. When using Terraform, which I tend to use for code examples, a stack correlates to a Terraform project, and a statefile.

Stack definitions

A stack definition is the code that declares what a stack should be. It is a Terraform project, CloudFormation template, and so on. A stack definition may use shared infrastructure code - for example, CloudFormation nested stacks or Terraform modules.

A stack definition is code used to provision stack instances

Below is an example stack definition, a Terraform project:

   ├── src/
   │   ├──
   │   ├──
   │   ├──
   │   ├──
   │   ├──
   │   ├──
   │   ├──
   │   ├──
   │   └──
   └── test/

This stack is part of my standalone service stack template project in github.

Stack instances

A stack definition can be used to provision, and update, one or more stack instances. This is an important point, because many people tend to use a separate stack definition for each of their stack instances - what I called Separate stack definition for each environment in my post on environment pipelines.

But as I explained in that post, this approach gets messy when you grow beyond very simple setups, particularly for teams. I prefer to use a single, parameterized stack definition template, to provision multiple environment instances.

Multiple stack instances can be provisioned from a single stack definition

A properly implemented stack definition can be used to create multiple stack instances. This helps to keep multiple environments configured consistently, without having to copy code between projects, with potential for mistakes and mess. Nicki Watt has a great talk on Evolving Your Infrastructure with Terraform which very clearly explains these concepts (and more).

Another benefit of parameterized stacks is that you can very easily create ad-hoc environments. Developers working on infrastructure code locally can spin up their own stack instances by running Terraform with appropriate parameters. Environments can be spun up on demand for testing, demos, showcases, etc., without creating bottlenecks or “freezing” environments. I never want to hear, “nobody touch QA for the next 3 days, we’re getting ready for a big presentation!”

A note on state

One of the pain points of using Terraform is dealing with statefiles. All stack management tools, including CloudFormation, etc., maintain data structures that reflect which infrastructure elements belong to a given stack instance.

Stack state

CloudFormation and similar tools provided by cloud platform vendors have the advantage of being able to manage instance state transparently - they keep these data structures on their servers. Terraform and other third party tools need to do this themselves.

Arguably, the explicit state management of Terraform gives you more control and transparency. When your CloudFormation stack gets wedged, you can’t examine the state data structures to see what’s happening. And you (and third parties) have the option of writing tools that use the state data for various purposes. But it does require you to put a bit more work into keeping track of statefiles and making sure they’re available when running the stack management tool.

Example of parameterized infrastructure

I have an example of a parameterized infrastructure stack project on github, using Terraform and AWS. Below is a (trimmed) snippet of code for a webserver.

resource "aws_instance" "webserver" {
  tags {
    Name                  = "webserver-${var.service}-${var.component}-${var.deployment_id}"
    DeploymentIdentifier  = "${var.deployment_id}"
    Service               = "${var.service}"
    Component             = "${var.component}"

This shows how it a number of variables are used to set tags, including a Name tag, to distinguish this server instance from other instances of the same server in other stack instances.

These variables are passed to the terraform command by a Makefile:

  -var "deployment_id=$(DEPLOYMENT_ID)" \
  -var "component=$(COMPONENT)" \
  -var "service=$(SERVICE)"

up: init ## Create or update the stack
  terraform apply $(TERRAFORM_VARS) -auto-approve

The SERVICE and COMPONENT variables are set by a file in the stack definition project, and are always the same. The DEPLOYMENT_ID variable is passed to the Makefile when make is run. The pipeline stage configuration sets this, so for example, the production stage configuration (using AWS CodeBuild in this case) includes the following:

resource "aws_codebuild_project" "prodapply-project" {
  name = "${var.service}-${var.component}-${var.estate_id}-ApplyToProdEnvironment"

  environment {
    environment_variable {
      "name"  = "DEPLOYMENT_ID"
      "value" = "prod"

The codebuild project simply runs make up, and our stack definition creates our webserver instance accordingly.

There’s more

The examples I’ve given imply each environment is a single stack. In practice, as environments grow, it’s useful to split them into multiple, independently managed stacks. I’ll elaborate on this in future posts.

Using Pipelines to Manage Environments

Tools like Terraform, AWS CloudFormation, Azure Resource Manager Templates, Google Cloud Deployment Manager Templates and OpenStack Heat are a great way to define server infrastructure for deploying software. The configuration to provision, modify, and rebuild an environment is captured in a way that is transparent, repeatable, and testable. Used right, these tools give us confidence to tweak, change, and refactor our infrastructure easily and comfortably.

But as most of us discover after using these tools for a while, there are pitfalls. Any automation tool that makes it easy to roll out a fix across a sprawling infrastructure also makes it easy to roll out a cock-up across a sprawling infrastructure. Have you ever corrupted the /etc/hosts files on every server in your non-production estate, making it impossible to ssh into any of them — or to run the tool again to fix the error? I have.

What we need is a way to make and test changes safely, before applying them to environments that you care about. This is software delivery 101 — always test a new build in a staging environment before deploying it to live. But the best way to structure your infrastructure code to do this isn’t necessarily obvious.

I’m going to describe a few different ways people do this:

  • Put all of the environments into a single stack
  • Define each environment in a separate stack
  • Create a single stack definition and promote it through a pipeline

The nutshell is that the first way is bad; the second way works well for simple setups (two or three environments, not many people working on them), and the third has more moving parts, but works well for larger and more complex groups.

Before diving in, here are a couple of definitions: I use the term stack (or stack instance) to refer to a set of infrastructure that is defined and managed as a unit. This maps directly to an AWS CloudFormation stack, and also to the set of infrastructure corresponding to a Terraform state file. I also talk about a stack definition as a file, or set of files, used by a tool to create a stack. This could be a folder with a bunch of *.tf files for Terraform, or a folder of CloudFormation template files.

I’ll show some code examples using Terraform and AWS, but the concepts apply to pretty much any declarative, “Infrastructure as Code” tool, and any automated, dynamic infrastructure platform.

I’ll also assume two environments — staging and production, for the sake of simplicity. Many teams end up with more environments — development, QA, UAT, performance testing, etc. — but again, the concepts are the same.

One stack with all the environments

This is the most straightforward approach, the one that most people start out using, and the most problematic. All of the environments, from development to production, are defined in a single stack definition, and they are all created and managed as a single stack instance.

Multiple environments managed as a single stack

Multiple environments managed as a single stack

The code example below shows a single Terraform configuration for both staging and production environments:

resource "aws_vpc" "staging_vpc" {
 cidr_block = ""

resource "aws_subnet" "staging_subnet" {
 vpc_id = "${}"
 cidr_block = ""

resource "aws_security_group" "staging_access" {
 name = "staging_access"
 vpc_id = "${}"

resource "aws_instance" "staging_server" {
 instance_type = "t2.micro"
 ami = "ami-ac772edf"
 vpc_security_group_ids = ["${}"]
 subnet_id = "${}"

resource "aws_vpc" "production_vpc" {
 cidr_block = ""

resource "aws_subnet" "production_subnet" {
 vpc_id = "${}"
 cidr_block = ""

resource "aws_security_group" "production_access" {
 name = "production_access"
 vpc_id = "${}"

resource "aws_instance" "production_server" {
 instance_type = "t2.micro"
 ami = "ami-ac772edf"
 vpc_security_group_ids = ["${}"]
 subnet_id = "${}"

This is a simple approach that makes everything visible in one place, but it doesn’t isolate the environments from one another. Making a change to the staging environment risks breaking production. And resources may leak or become confused across environments, making it even easier to accidentally affect an environment you don’t mean to.

Charity Majors shared problems she ran into with this approach using Terraform. The blast radius (great term!) of a change is everything included in the stack. And note that this is still true even with tools that don’t use statefiles as Terraform does. Defining multiple environments in a single CloudFormation stack is asking for trouble.

A Separate stack definition for each environment

Charity (and others) suggests splitting your environments into separate stack definitions. Each environment would have its own directory with its own Terraform configuration:


These two different configurations should be identical, or nearly so. By running your infrastructure tool against each of these separately, you isolate the environments from one another (at least when it comes to using the tool, although obviously they may or may not be isolated in terms of networking, cloud account permissions, etc.) And because each environment has its own set of definition files, the intended state of each environment is very clear.

Each environments managed as its own stack

When you need to make a change, you edit the files for the staging environment stack, apply them, and test them. Iterate until it works the way you want. Then make the same changes to the files for the production environment and apply them to the production stack instance.

A drawback with this approach is that it’s easy for differences to creep into environments. Some of these differences may be accidental, as when someone in a rush makes a fix directly to the production environment configuration, and forgets to go back to the other environments to make the same change.

But often, as with statically defined environments, people try things out in different environments, and leave them in when they’re distracted by other tasks. Over time, each environment becomes a snowflake. Even though each environment is defined as code, differences between the environments means there is no trust that a given change can be quickly and safely applied across them all.

So maintaining a separate definition file per environment requires vigilance and discipline to keep them consistent.

Modules can be used to share code across environments, which can help with consistency. But modules can also create tight coupling. Making a change to a shared module requires ensuring the change won’t break other environments. The ability to test a change to a module in one environment before applying it to production adds complexity for versioning and release management.

One stack definition managed with a pipeline

An alternative is to use a continuous delivery pipeline to promote a stack definition file across environments. Each environment has its own stack instance, so the blast radius for a change is contained to the environment.

But a single stack definition is re-used to create and update each environment. The definition can be parameterized, to capture differences between instances, such as cluster sizing. The definition file is versioned, so we have visibility of what code was used for any environment, at any point in time.

One definition used for multiple environments

A single definition file used to create multiple stack instances in a pipeline

The moving parts for implementing this are: a source repository, such as a git repository; an artefact repository; and a CI or CD server such as GoCD, Jenkins, etc.

A simple example workflow is:

  1. Someone commits a change to the source repository.
  2. The CD server detects the change and puts a copy of the definition files into the artefact repository, with a version number.
  3. The CD server applies the definition version to the first environment, then runs automated tests to check it.
  4. Someone triggers the CD server to apply the definition version to production.

Basic flow of a stack definition through a pipeline Basic flow of a stack definition through a pipeline


Teams I work with use pipelines for infrastructure for the same reasons that development teams use pipelines for their application code. It guarantees that every change has been applied to each environment, and that automated tests have passed. We know that all the environments are defined and created consistently.

With the previous “one stack definition per environment” approach, creating a new environment requires creating a new folder with its own copy of the files. These files then need to be maintained and kept up to date with changes made to other environments.

But the the pipeline approach is more flexible. New environment instances can be spun up on demand, which has a number of benefits:

  • Developers can create their own sandbox instances, so they can deploy and test cloud-based applications, or work on changes to the environment definitions, without conflicting with other team members.
  • Changes and deployments can be handled using a blue-green approach — create a new instance of the stack, test it, then swap traffic over and destroy the old instance.
  • Testers, reviewers, and others can create environments as needed, tearing them down when they’re not being used.

Artefact repository, promotion, and versioning

The pipeline treats a stack definition as a versioned artefact, promoting it from one stage to the next. As per Continuous Delivery, an artefact is never changed. This is why I suggest a separate artefact repository in addition to the version control repository. The version control is used to manage changes to the definition, the artefact repository is used to preserve immutable, versioned artefacts. However, I’ve seen people use their code repository for both, which works fine as long as you ensure that the principle of not making changes to a definition version once it’s been “published” and used for any environment.

For stack definitions, I like to use an S3 bucket (or an equivalent) as a repository. I have a pipeline stage that “publishes” the definitions by creating a folder in the bucket with a version number in the name, and copying the files into it. This code is executed by a CI/CD server agent:

aws s3 sync ./our-project/ s3://our-project-repository/1.0.123/

Promoting a version of a stack definition can be done in different ways. With the S3 bucket, I’ll sometimes have a folder named for the environment, and copy the files for the relevant version into that folder. Again, this code runs from a CI/CD agent:

aws s3 sync — delete \
 s3://our-project-repository/1.0.123/ \

The pipeline stage for any environment simply runs Terraform (or CloudFormation), grabbing the definitions from the relevant folder.

Running the tool

The commands are pretty much all run by the CI or CD server on an agent. This does a few good things. First, it ensures that the command is fully automated. There’s no relying on “just one” manual step or tweak, which expands with “just one more, for now”. You don’t end up with different people doing things in their own way. You don’t end up with undocumented steps. And you don’t have to worry, in an emergency situation, that someone will forget that crucial step and bork the environment. If there’s a step that can’t easily be automated, then find a way.

Another good thing is that you don’t worry about different people applying changes to the same environment. Nobody applies changes to a common environment by running the tool from their own machine. No need to worry about locking and unlocking. No need to worry about locally edited changes. The only changes made to an environment you care about are the ones that have been fed into the pipeline.

This also helps with consistency. Nobody can make a change that requires a special plugin or utility they’ve installed on their own laptop. There is no tweak that someone made when applying a change to staging, that might be forgotten for production. The tool is always run from a CD agent, which is built and configured consistently, always using the same script, no matter which environment it runs from.

Developer workflow

People working on the infrastructure probably need to make and test changes to stack definitions rapidly, before committing into the pipeline. This is the one place you may want to run the tool locally. But when you do, you’re doing it with your own instance of the environment stack. Pull the latest copy of the definitions from the source repo, run the tool to create the environment, then start making changes. (Write some automated tests, of course, which should also be in the source repo and automatically run in relevant environments in the pipeline.)

When you’re happy, make sure you’ve pulled the latest updates from master, and that your tests pass. Then commit the changes and destroy your environment.

No need to worry about breaking a common environment like staging. If someone else is working on changes that might conflict with yours, you should see them when you pull from master. If the changes merge OK, but cause issues, they should be caught in the first test environment in the pipeline. The build goes red, everyone stops and looks to see what needs to be done to fix it.

When is this appropriate?

I’ve been using pipelines to manage infrastructure for years. It seemed like an obvious way to do it, since ThoughtWorks teams use pipelines for releasing software as a matter of course. But recently I’ve been running across experienced people who maintain separate definition file for each environment.

In some cases I think people simply haven’t been exposed to the idea of using a pipeline for infrastructure. Hence this article. But I’m also aware that these two different approaches (and maybe others I’m not aware of) are appropriate for different contexts.

The pipeline approach has more moving parts than simply having a separate stack definition for each environment. For smaller teams, with a simple release process and not many environments, a pipeline may be overkill. But when you have more people working on a system, and especially when there are multiple teams and roles, a pipeline creates a consistent, reliable process for rolling out infrastructure changes. It may also make sense even for smaller teams who are already using a pipeline for their software to add their infrastructure files into it, rather than having a separate process for environments and applications.

There are some caveats. It takes time to get comfortable with making infrastructure changes using a pipeline. You may not get as much value without at least some degree of automated testing, which is another discipline to build. Running infrastructure management tools automatically from a CD server agent requires a secure approach to managing secrets, to avoid your automation system from becoming an easy attack vector.

Beyond pipelines

The pipeline approach I’ve described answers the question of how to manage multiple instances of a particular environment, in a “path to production”. There are other challenges with managing a complex infrastructure codebase. How do you share code, infrastructure, and/or services between teams? What’s the best way to structure and release modules of shared infrastructure code?

I’ve also spent some time thinking about patterns I’ve seen for splitting a single environment into multiple stacks, which I described in a talk. If this article is about applying principles and practices from Continuous Delivery to infrastructure, that talk was about applying principles from Microservices to infrastructure.

I hope this article gives people food for thought, and would like to hear what other approaches people have tried and found useful.


Aside from inspiration from Charity Majors and Yevgeniy Brikman, I received feedback from my ThoughtWorks colleagues Nassos Antoniou, Andrew Langhorn, Pat Downey, Rafael Gomes, Kevin Yeung, and Joe Ray, and also from Antonio Terreno.

This article was originally posted on Medium and on ThoughtWorks Insights.

Infrastructure as Code has been published

Infrastructure as Code is officially published! The “Pre-order” buttons on and the O’Reilly shop have flipped to “Add to Cart” for both print and e-book formats. I’m told that “boxes and boxes” of the book have arrived in the ThoughtWorks office, although I’m not in today so I don’t have my greasy mitts on a copy yet.

Links to various sites to order the book are over on the right of this page.

Screenshot of the order page on Amazon

This book has been a long haul. It’s hard to express how important the support of so many people has been. ThoughtWorks has been incredibly supportive, on so many levels. I doubt I would have done it without the examples and inspiration of my amazing colleagues across the globe. Martin Fowler has been a tremendous mentor. Rong Tang did the artwork, showing infinite patience as she turned inconsistent, muddled, and incoherent scribbles into great looking imagery.

I owe a great debt to the people of the DevOpsDays community, who are collectively responsible for the ideas of DevOps and Infrastructure as Code. I took a stab at collating what I’ve learned from many peoples’ ideas and experiences into something that hopefully will help other people as much as this community has helped me.