One of the earliest benefits that drew people like me to infrastructure as code was the promise of eliminating snowflake servers.
In the old times, we built servers by logging into them and running commands. We might build, update, fix, optimize, or otherwise change servers in different environments in different ways at different times. This led to configuration drift, inconsistencies across environments.
Thanks to snowflakes and configuration drift, we spent huge amounts of effort to get an application build that worked fine in the development environment to deploy and run in production.
Flash forward 10+ years, infrastructure as code has become commonplace, helping us to manage all kinds of stuff in addition to, and often instead of, servers. You’d think snowflake infrastructure would be a thing of the past.
But it’s actually quite common to see people following practices that lead to differences between instances of infrastructure - snowflakes as code.
Antipattern: Snowflakes as code
Snowflakes as code is an antipattern where separate instances of infrastructure code are maintained for multiple instances of infrastructure that are intended to be essentially the same.
A common example is when multiple environments are provisioned as separate instances of infrastructure, each with its own separate copy of the code. These code instances are snowflakes when differences between the infrastructure instances are maintained by differences in the code.
When someone makes a change to the code for one instance, they copy or merge the change to other instances. The process for doing this is usually manual, and involves effort and care to ensure that deliberate differences between instances are maintained, while avoiding unintended differences.
This antipattern also occurs when infrastructure is replicated for different deployments of similar applications - for different customers, for example - or to deploy multiple application instances in different regions.
Different instance of infrastructure, even ones intended to be consistent, will always need some variations between them. Resources like clusters and storage may be sized differently for a test environment than for production, for example. If nothing else, resources may need different names, such as database-test, database-staging, and database-prod.
Maintaining a separate copy of infrastructure code for each instance is an obvious way to handle these variations.
The issue with maintaining different versions of infrastructure code for instances that are intended to be similar is that it encourages inconsistency - configuration drift. Once you accept editing code when copying or merging it between instances as a way to handle configuration, it becomes easy for larger differences to persist. For example:
- I make a fix to the production infrastructure, but don’t have time to copy it back to upstream environments. The fix then clashes with changes you make in upstream environments.
- I’m working on a fairly complex change in the staging environment that drags on for days, or longer. Meanwhile, you need to make a small, quick fix and take it into production. Testing in staging becomes unreliable because it doesn’t currently reflect production.
- We need to define security policies differently in production than for non-production environments. We implement this with different code in each environment, and hope nobody accidentally copies the wrong file to the wrong place.
Another consequence is the likelihood of making a mistake when copying or merging changes from one instance to the next. Don’t forget to copy/replace every instance of staging to prod! Don’t forget to change the maximum node count for the database cluster from 2 to 6! Ooops!
The two main ways people implement snowflakes as code are folders and branches.
Teams who use branches to maintain infrastructure code for each of their environments (as described below under Implementation) often do this because they are using GitOps. GitOps uses tools that apply code from git branches to the infrastructure, so encourages maintaining a separate branch for each environment.
It’s possible to use branches this way without them becoming snowflakes, as described below in Alternatives. But when your process for promoting code involves merging and tweaking code to maintain instance-specific differences, then you’ve got snowflakes as code.
Other teams use a folder structure to maintain separate projects for each environment. They copy and edit code between projects to make changes across environments. Again, it’s the need to edit files when copying them to a new environment that signals this antipattern.
An alternative to snowflakes as code is to reuse a single instance of infrastructure code for multiple instances of the infrastructure.
You can maintain multiple versions of the code so that you can apply changes to different instances at different times, for example so you can have a pipeline to deliver changes to environments in a path to production.
But code for an existing version should never be edited. This is Continuous Delivery 101 - only make changes in the origin (for example, trunk), then copy the code, unmodified, from one environment to the next.
Using an automated process to promote infrastructure code from one instances to the next reduces the opportunity for manual errors. It also removes the opportunity to “tweak” code to maintain differences across environments, forcing better discipline.
If the need for a change is discovered in a downstream environment, the change is first made to the origin, then progressed to the downstream environment without modifications. This ensures that every code change has been put through all of the tests and approvals needed.
As mentioned earlier, there usually is a need for some variations between instances, such as resource sizing and names. These variations should be extracted into per-instance configuration values, and passed to the code when it is applied to the given instance. Chapter 7 of my book covers different patterns for doing this, including configuration files and configuration registries.
Many teams follow the common development pipeline pattern of having a build stage that bundles the infrastructure code into a versioned artifact, storing it in a repository, and using that to ensure consistency of code from one environment to the next. A simple implementation of this pattern can be implemented using tarballs and centralized storage like an S3 bucket.
Tools like Terraform support multiple instances of infrastructure with different versions of the same code using workspaces.
Github introduced the pull request practice, and features to support it, to make it easier for people who run open-source projects to accept contributions from outside their group of trusted committers.
Committers are trusted to make changes to the codebase routinely. But a change from a random outsider needs to be assessed to make sure it works, doesn’t take the project in an unwanted direction, and meets the standards for style and quality. The outsider packages their proposed change as a pull request, which a committer can easily review and manage as a unit before merging it into the codebase.
Figure 1: Pull request process
Although designed to make it easier to accept contributions from untrusted people outside a team, many teams now use pull requests for people inside their own team. This practice has become so common that many people consider it a default, “best” practice. Some people assume there is no other way to make sure code is reviewed because they’ve never seen anything else.
However, pull requests sacrifice performance, including both delivery time and quality. This is a sacrifice worth making to manage the risk of accepting changes from unknown people. An outsider may not understand the vision and direction of your project. They may not have the same habits and norms for testing, code quality, and style. However, your own team members should share these norms.
Using pull requests for code changes by your own team members is like having your family members go through an airport security checkpoint to enter your home. It’s a costly solution to a different problem.
Using Continuous Integration rather than pull requests
A software delivery process should optimize for flow and quality. Keep the lead time for changes low, and give fast feedback when a change introduces a problem. This is the idea that underpins Continuous Integration (CI). CI is the practice of continuously merging and testing everyone’s code as they work on it.
Figure 2: Continuous Integration process
“As they work on it” is essential. As a team member, you don’t wait until you have finished a feature or story to integrate your code to the mainline. Instead, you frequently - at least once a day - put your code into a healthy state that passes tests and integrate it into the mainline with everyone else’s current work. (Also see Martin Fowler’s article on branching patterns and Paul Hammant’s trunk-based development site.)
A CI build job automatically tests the project’s mainline every time you push a change. This means you find out immediately if what you’re doing clashes with something another person is working on before either of you has invested too much time. It sucks to think you’ve finished a story or feature, only to discover you’ve got to go back and untangle and redo several days of effort.
Figure 3: Tests run on integrated code on every push
The trouble with pull requests
A pull request introduces a delay to integration. When you complete work that you consider ready to integrate with the rest of the team, you create a pull request and wait for someone to review it. Only after someone else reviews the change do they integrate it with the mainline.
If team members are quick to review and integrate pull requests, this is only slightly slower than CI. Maybe they respond and review your change within 30 minutes every time you push. Your code change is integrated with the mainline and automated tests run against it. So you may discover a clash with someone else’s work after 30-40 minutes or so.
Figure 4: Delays in feedback with pull requests versus CI
In practice, not many teams reliably turn pull requests around in under 30 minutes. While waiting for someone to review your change, you may switch to another task or start working on a new change. When you find out there was a problem, you need to switch gears back to the original change, disrupting your flow of work.
An effective CI build, on the other hand, should finish testing your integrated code within a few minutes after you push it - up to 10 minutes in our scenario. You discover that clash almost immediately, so you can investigate and fix it while it’s fresh in your mind.
You don’t need to interrupt someone else’s work to ask them to review it before you get the feedback from testing fully integrated code. As I’ll explain shortly, you may still have someone review your changes. But you can take advantage of a faster cycle time to commit, integrate, and test your code to make multiple changes before asking them to review.
Even if everyone in the team turns pull requests around quickly, the typical practice is to wait until completing work on a feature or story before integrating a pull request with the mainline. Most teams take longer than a day, on average, to develop a story. So a typical pull request process doesn’t meet the minimum requirement of Continuous Integration to integrate everyone’s work at least daily.
Working in a rhythm of coding, pulling, testing, pushing, and getting feedback from integrated tests several times a day is electrifying. And it isn’t possible with pull requests that introduce a human delay into the rhythm.
Better ways to review code changes
When the topic of CI versus pull requests comes up, someone inevitably defends pull requests as necessary to get feedback from other team members on changes.
It is essential to have a second pair of eyes (if not more) looking at code changes. Humans catch problems that tests don’t, especially problems related to maintainability and sound design. Having people review each others’ code also helps the team converge on norms for coding style, programming idioms, and quality expectations. And in some cases, such as regulated environments, having each change reviewed by a second person is required.
However, the recent popularity of pull requests seems to have resulted in some people assuming there are no other ways to review code changes. Here are a few practices that you can use instead, without interrupting the Continuous Integration feedback cycle. Keep in mind that it’s entirely possible to combine more than one of these as appropriate.
Figure 5: Pairing for immediate, continuous code review
Pair programming: No form of code review is more effective than pairing. Feedback is immediate, so there is a far higher chance you will use it to make improvements. If someone tells you as you write some code that there’s a better way, you can stop, learn, and write it in that better way, right then. If someone tells you a day later, you might take it on board for future reference. But it needs to be a serious problem to get you to stop your current work to go back and redo something you’ve already finished.
Periodic reviews: If a review is not explicitly required for compliance, it may not need to be a gate for each code change. You might have regular, scheduled reviews, for example weekly, where people check through code changes since the last review. This can be especially potent as a group exercise since it creates conversations that help people learn and shape the team’s norms for coding.
Pipeline approvals: If your team uses a Continuous Delivery pipeline to deliver changes to production, you can include a stage that requires someone to authorize the change to progress. This is conceptually similar to a pull request in that it is a gate in the delivery process, but you place the gate after code integration and automated tests. Doing this means that a human only spends time reviewing code that has already been proven technically correct.
Figure 6: Review changes after they are integrated and tested
Pull requests differ from Continuous Integration in having a human review a code change after writing it but before integrating it with the mainline. This creates a delay in getting feedback from automated tests against fully integrated code.
With Continuous Integration, code is either reviewed as it is written (pairing), or after it is integrated and tested. Optimizing the loop for integrating and testing changes means you can run this loop more frequently. A more frequent coding and integration loop encourages developers to make smaller and more frequent commits, which improves quality and flow.
The second edition of Infrastructure as Code is out! Mostly. E-Books are available now (Amazon.com | Amazon.co.uk | Amazon.in | O’Reilly), while the dead-tree version is trundling across rails, roads, and sea lanes towards your local bookshop. I’m told to expect it out in January 2021.
This is super exciting for me, and I hope people find the new edition useful. I talk a bit about the book on the book page. I rewrote pretty much the entire book - 4 years is a long time in this field.
Most infrastructure projects I’ve been involved with have a script, or usually a set of scripts that act like a build tool for software projects. These are often implemented using Makefiles, shell scripts, batch scripts, Rakefiles, or languages like Python and Ruby.
These project orchestration scripts do many jobs, depending on the project. Some of the jobs include:
Assemble and package project code for use. This might include pulling libraries and other dependencies. It could even involve downloading the infrastructure tools and packaging everything as a container image, creating an executable project.
Run static tests and possibly other offline tests (for example, using tools like Localstack) on the code outside the context of an instance of the infrastructure.
Assemble configuration values for a given instance of the infrastructure. These values might come from configuration files, parameter registries, existing infrastructure, or a combination of these.
Execute the infrastructure tool for an instance. This includes running the plan command for tools that support it and creating, changing, and destroying infrastructure.
Orchestrate commands across multiple infrastructure components and projects. For example, if different parts of an environment are built from different Terraform projects, the script might run commands for each project in the correct order, based on the dependencies between them.
Run tests against an instance of the infrastructure.
Many infrastructure project orchestration scripts handle a combination of these jobs. This tends to create messy, complicated code. Any code, including orchestration scripts, should follow good software design principles, including SOLID, DRY, and Separation of Concerns. Orchestration scripts should separate the different jobs and concerns into different parts, rather than having a master script that knows all. The Unix philosophy applies here.
Another issue with many infrastructure project scripts is that they are snowflakes, custom-built for each project. The script code often embeds knowledge of the projects it orchestrates, such as dependencies between projects and the names of configuration parameters each project needs. And team members spend considerable time and energy designing, implementing, and fixing their unique system of scripts.
I don’t believe there is value in building and maintaining unique scripts for an infrastructure project. Most of the differences in infrastructure build projects I’ve seen don’t come from meeting the project’s specific needs, but rather from the specific knowledge and preferences of the people who built the project.
So I’m interested in standardized tools to orchestrate infrastructure projects. I’d like to see opinionated tools that prescribe how to structure directories, manage configuration values, and integrate multiple projects. The challenge is finding a tool with the right opinions, “right” meaning I agree with them!
I’ll save elucidating the opinions I would agree with for another post. For now, here’s a list of tools that I’m aware of. At this point, I haven’t looked at these close enough to compare them with my own opinions about infrastructure project design.
Orchestration tools for Terraform
- Astro, a tool for managing multiple Terraform executions as a single command. Seems to focus on wiring Terraform modules together.
- Rake Terraform, libraries for running Terraform from Rake tasks. A part of the Infrablocks project
- Tau, Terraform Avinor Utility, another tool that orchestrates Terraform modules and configuration.
- Terragrunt, a thin wrapper that provides extra tools for keeping your configurations DRY, working with multiple Terraform modules, and managing remote state.
- Terraspace is an opinionated, convention over configuration tool that provides a project layout and handles configuration and integration of multiple projects.
- Terraform Scaffold orchestrates Terraform modules and configuration across multiple environments and components on AWS.
- Terranova - library to help you write golang code that implements terraform commands without the binary. So you can combine project orchestration and your infrastructure definitions, which sounds like an invitation to write code that spectacularly fails to separate concerns. But the possibilities are intriguing.
Orchestration tools for CloudFormation
There must be more of these than I know of. I’ve listed a couple that aren’t current but could be interesting.
- Rain, a development workflow tool for working with AWS CloudFormation. Currently in preview, not production-ready.
- Autostacker24, a Ruby utility to manage AWS CloudFormation stacks. I may or may not have been present for this tool’s conception, including suggesting the name. I’m not sure how active development is.
- cfnassist, a cloud formation helper tool. Not very active development.
I’ve delivered the second edition of the book to O’Reilly’s production department, and the wheels are turning to have it available by the end of the year. See the Book page for details on pre-ordering.
Why I wrote the first edition
The benefits of infrastructure as code don’t come from the tools themselves. They come from how you use them. The trick is to leverage the technology to embed quality, reliability, and compliance into the process of making changes.
I wrote the first edition of this book because I didn’t see a cohesive collection of guidance on how to manage infrastructure as code. There was plenty of advice scattered across blog posts, conference talks, and documentation for products and projects. But you needed to sift through everything and piece a strategy together for yourself, and most people didn’t have the time.
The experience of writing the first edition was amazing. It gave me the opportunity to travel and talk with people around the world about their own experiences. These conversations gave me new insights and exposed me to new challenges. I learned that the value of writing a book, speaking at conferences, and consulting with clients is that it fosters conversations. As an industry, we are still gathering, sharing, and evolving our ideas for managing infrastructure as code.
Why a second edition
Things have moved along since the first edition came out in June 2016. That edition was subtitled “managing servers in the cloud,” which reflects that most infrastructure automation until that point focused on configuring servers. Since then, containers and clusters have become a much bigger deal, and the infrastructure action has moved to managing collections of infrastructure resources provisioned from cloud platforms, what I (and many but not all other people) call stacks.
So the new edition talks a lot more about building stacks, the remit of tools like CloudFormation, Terraform, and Pulumi.
I’ve changed quite a bit based on what I’ve learned about the evolving challenges and needs of teams building infrastructure. As I’ve already touched on, I see making it safe and easy to change infrastructure as the key benefit of infrastructure as code. I believe people underestimate the importance of this, thinking that infrastructure is something you build and forget.
But too many teams I meet struggle to meet the needs of their organizations, not able to expand and scale quickly enough, support the pace of software delivery, or provide the reliability and security expected. And when we dig into the details of their challenges, it’s that they are overwhelmed by the need to update, fix, and improve their systems. So I’ve doubled down on this as the core theme of the second edition.
The new edition introduces three core practices for using Infrastructure as Code to make changes safely and easily. Define everything as code is obvious from the name, and creates repeatability and consistency. Continuously integrating, testing, and delivering each change enhances safety. It also makes it possible to move faster and with confidence. Small, independent pieces are easier and safer to change than larger pieces.
These three practices are mutually reinforcing. Code is easy to track, version, deliver across stages of a change management process. It’s easier to continuously test smaller pieces. Continuously testing each piece on its own forces you to keep a loosely coupled design.
These practices and the details of how to do them are familiar from the world of software development. I drew on agile software engineering and delivery practices for the first edition of the book. For the new edition I’ve also drawn on rules and practices for effective design.
In the past few years I’ve seen teams struggle with larger and more complicated infrastructure systems, and seen the benefits of applying lessons learned in software design patterns and principles. So I’ve included several chapters on how to do this.
I’ve also seen that organizing and working with infrastructure code is difficult for many teams, so I’ve addressed various pain points I’ve seen. How to keep codebases well organized, provide development and test instances for infrastructure, and manage collaboration of multiple people, including those responsible for governance.
I don’t believe we’ve matured as an industry in how we manage infrastructure. I’m planning to write a bit more on this blog and elsewhere on what I see as ways we can do better. I’m also hoping to assemble examples of infrastructure code that illustrate how to do this.