Infrastructure as Code

Measuring Infrastructure Delivery Effectiveness

2024-11-11T06:20:20+00:00

Measures are useful for making decisions about infrastructure delivery workflows and practices, as well as about design and team organization. The four key metrics identified by the DORA group are a good base set of measures. The performance of software delivery processes on these metrics, although also affected by other factors, is an indicator of how well environments, platforms, and infrastructure for delivery and production hosting enable delivery effectiveness.

As a refresher, the four key metrics are:

Delivery lead time: The elapsed time it takes to implement, test, and deliver changes to the production system

Deployment frequency: How often changes are deployed to production systems Change fail percentage: What percentage of changes either cause an impaired service or need immediate correction, such as a rollback or emergency fix Mean Time to Restore (MTTR): How long it takes to restore service when there is an unplanned outage or impairment

The four key metrics can also be measured for changes to infrastructure and platform services, giving an idea of the effectiveness of infrastructure delivery systems and processes. Other metrics that may be useful for infrastructure delivery include:

Effort: How much expert time is needed to complete a change? Self-service systems and other automation can reduce this number.

Toil: How much of your infrastructure and platform team members’ time is spent on work that could potentially be removed, typically repetitive, manual, tactical work. See Eliminating Toil](https://sre.google/sre-book/eliminating-toil/) from the book Site Reliability Engineering.

Version Spread: How many different versions of a given software, infrastructure, and systems are currently deployed to systems in the estate. Keeping systems patched and upgraded reduces the time needed to maintain them and avoids the potential number of vulnerabilities.

Utilization: How often environments and other infrastructure is actually in use. Replacing static, long-running environments with dynamically provisioned “Environments as a Service” can reduce waste.

Some interesting infrastructure tools

2024-10-28T04:20:20+00:00

Tools that seem interesting

I’ve been advocating better ways of working for delivering infrastructure as code. There are a number of interesting tools around that seem to be addressing different aspects of the problem with different approaches. I have little or no experience with most of these tools, so I’m not necessarily recommending them as things you really should be using. But I do recommend giving them a look to see if they can help you.

Infrastructure composition

Most team I’ve worked with have a set of hand-crafted infrastructure code orchestration scripts. These scripts usually manage the deployment of infrastructure stacks, including sequencing, configuration, and integration of multiple stacks. A number of tools are cropping up to replace the need to build your own set of scripts. Some of the ones I’ve heard about include:

Atmos from Cloud Posse
Gruntwork stacks a proposal from the creators of Terragrunt
Infrablocks from a group of my former Thoughtworks colleagues at Atomic Innovation
Score from the creators of Crossplane
Terraform stacks
Terraspace

Infrastructure deployment tools

Many of us run infrastructure deployment tools, or our orchestration scripts, from a pipeline service like Github Actions. However, another area where vendors and products are popping up is services and tools to manage this process. Some are combined with infrastructure composition languages and tools.

Many infrastructure as code tool vendors offer hosted deployment services, including Pulumi Cloud and Terraform Cloud, usually as part of a suite of services.

Some IaaS cloud vendors also have services to run infrastructure code. The AWS Service Catalog supports both CloudFormation and Terraform. Google Coud Infrastructure Manager and Oracle Cloud Resource Manager both support Terraform.

Some third party vendors provide solutions that work with multiple infrastructure as code tools, such as env0, Garden, and Spacelift.

There are the so-called TACoS Terraform Automation and COlaboration Software, which are specifically for use with Terraform and OpenTofu. TACoS products and services include:

Atlantis Control Monkey Digger Gruntwork DevOps Foundations Harness Scalr Terrakube Terramate Terrateam

Infrastructure as Data

GitOps uses a control loop to continuously synchronize a deployed application with its configuration, usually using Kubernetes. Infrastructure as Data applies the same pattern to infrastructure code. Infrastructure definitions are stored in a central service, again usually a Kubernetes cluster, which uses a control loop to continuously synchronize the definitions with the running infrastructure

Examples of tools and services for Infrastructure as Data include:

ACK Crossplane GCP Config Connector Azure Service Operator

Although most infrastructure as data systems are based on Kubernetes, IaSQL is an intriguing alternative that uses the PostgreSQL database instead. Infrastructure code is written in SQL, and a connector synchronizes the definitions in the database with infrastructure hosted on AWS.footnote:[As of this writing there is not much activity on the IaSQL project. Whether it’s active or not, it’s an interesting example of different ways of approaching infrastructure as code.]

Infrastructure from Code

Another category of tool embeds infrastructure code into application code, moving the boundary between applications and infrastructure. Examples of these include Darklang, Nitric, and Winglang.

This pattern seems particularly popular with serverless developers, leveraging the ability of infrastructure tools like AWS CDK to write infrastructure code in the same language used to write the application code. This serverless Hello World application tutorial is a typical example of infrastructure from code.

See also Gregor Hohpe’s discussions of approaches to application and infrastructure code in his article, IxC: Infrastructure as Code, from Code, with Code.

Infrastructure as Model

System Initiative is introducing a way of managing infrastructure by maintaining a data model that is dynamically synchronized with the running infrastructure. It’s a bit like Terraform state files, but far more dynamic and programmable.

New Early Release chapters for Infrastructure as Code 3rd edition

2024-09-07T10:20:20+01:00

O’Reilly has published several new chapters in the Early Release of the third edition of the Infrastructure as Code book. That’s fifteen out of twenty-one chapters now available! We’ll be adding more chapters before the end of the year, and then the final book should be available around March of 2025.

I’m now going back through the draft to tighten it up. I’ve been writing it over the last eighteen months, and I’ve found that I’ve evolved a bit in my terminology and some of the concepts. I can also see that I’ve duplicated myself in places, such as explaining the same concept in different parts of the book. So I need to make sure it all hangs together more tightly than in this draft.

I’m also revisiting the examples I use throughout the book. I haven’t been consistent in using examples, which would make some chapters easier to understand. I’m changing the example I’ve used so far, “ClotheSpin”, with “FoodSpin”, a company that provides digital menus for restaurants. It’s a more straightforward concept and architecture, and I’m happier with the examples I’m making from it.

Here’s the current outline of the book. There are one or two places where chapters may shift or be re-titled, but the final release will be very close to what you see here.

If you want to keep up to date with news about the book, and content that I post, consider subscribing to my new infrastructure as code newsletter!

Book outline

Chapters listed in bold have a draft version available in Early Release, chapters listed in italics are still in the pipeline.

Part I: Foundations

What Is Infrastructure as Code?
Principles of Cloud Infrastructure
Platforms and Toolchains
Defining Infrastructure as Code

Part II: Design

Design Principles For Infrastructure as Code
Infrastructure Components
Design Patterns for Infrastructure Deployment Stacks
Configuring Stack Deployment Instances
Integrating Deployment Stacks
Infrastructure Code Libraries
Building Servers as Code
Designing Environments
Applications and Infrastructure

Part III: Delivery

Infrastructure Delivery Lifecycle
Testing Infrastructure Code
Infrastructure Delivery Pipelines
Building Infrastructure Code
Infrastructure Validation Stages
Deploying Infrastructure
Changing Existing Infrastructure
Teams and Workflows

Tracer Bullet Pipeline

2024-07-18T01:56:00+01:00

(Originally posted 26 November, 2012)

On my current project we’re developing an essentially green field application, albeit one that integrates a fair bit of data managed in existing systems, in conjunction with the implementation of a new hosting infrastructure which will be used for other applications once it is established. We want to have a solid Continuous Delivery Pipeline to support the team developing the application, as well as to support the development and maintenance of the infrastructure platform.

In order to get the team moving quickly, we’ve kicked this all off using what we’ve called a “tracer bullet” (or “trail marker”, for a less violent image). The idea is to get the simplest implementation of a pipeline in place, priortizing a fully working skeleton that stretches across the full path to production over a fully featured, final-design functionality for each stage of the pipeline.

Our goal is to get a “Hello World” application using our initial technology stack into a source code repository, and be able to push changes to it through the core stages of a pipeline into a placeholder production environment. This sets the stage for the design and implementation of the pipeline, infrastructure, and application itself to evolve in conjunction.

Use cases

This tracer bullet approach is clearly useful in our situation, where the application and infrastructure are both new. But it’s also very useful when starting a new application with an existing IT organization and infrastructure, since it forces everyone to come together at the start of the project to work out the process and tooling for the path to production, rather than leaving it until the end.

The tracer bullet is more difficult when creating a pipeline from scratch for an existing application and infrastructure. In these situations, both application and infrastructure may need considerable work in order to automate deployment, configuration, and testing. Even here, though, it’s probably best to take each change made and apply it to the full length of the path to production, rather than wait until the end-all be-all system has been completely implemented.

Goals

When planning and implementing the tracer bullet, we tried to keep three goals in mind as the priority for the exercise.

Get the team productive. We want the team to be routinely getting properly tested functionality into the application and in front of stakeholders for review as quickly as possible.
Prove the path to production. We want to understand the requirements, constraints, and challenges for getting our application live as early as possible. This means getting everyone involved in going live involved, and, using the same infrastructure, processes, and people that will be used for going live, so that issues are surfaced and addressed.
Put the skeleton in place. We want to have the bare bones of the application, infrastructure, and the delivery pipeline in place, so that we can evolve their design and implementation based on what we learn in actually using them.
Things can and should be made simple to start out with. Throughout the software development project changes are continuously pushed into production, multiple times every week, proving the process and identifying what needs to be added and improved. By the time the software is feature complete, there is little or no work needed to go live, other than DNS changes and publicizing the new software.

“Do’s” and “Do Not Do’s”

Do start as simply as you can

Don’t implement things that aren’t needed to get the simple, end to end pipeline in place. If you find yourself bogged down implementing some part of the tracer bullet pipeline, stop and ask yourself whether there’s something simpler you can do, coming back to that harder part once things are running. On my current project we may need a clever unattended provisioning system to frequently rebuild environments according to the PhoenixServer pattern. However, there are a number of issues around managing private keys, IP addresses, and DNS entries which make this a potential yak shave, so for our tracer bullet we’re just using the Chef knife-rackspace plugin.

Don’t take expensive shorcuts

The flip side of starting simply is not to take shortcuts which will cost you later. Each time you make a tradeoff in order to get the tracer bullet pipeline in place quickly, make sure it’s a positive tradeoff. Keep track of those tasks you’re leaving for later.

Examples of false tradeoffs are leaving out testing, basic security (e.g. leaving default vendor passwords in place), and repeatability of configuration and deployment. Often times these are things which actually make your work quicker and more assured - without automated testing, every change you make may introduce problems that will cost you days to track down later on.

It’s also often the case that things which feel like they may be a lot of work are actually quite simple for a new project. For my current project, we could have manually created our pipeline environments, but decided to make sure every server can be torn down and rebuilt from scratch using Chef cookbooks. Since our environments are very simple - stock Ubuntu and a JDK install and we’re good to go - this was actually more trivial than it would have been later on once we’ve got a more complicated platform in place.

Don’t worry too much about tool selection

Many organizations are in the habit of turning the selection of tools and technologies into complicated projects in their own right. This comes from a belief that once a tool is chosen, switching to something else will be very expensive. This is pretty clearly a self-fulfilling prophecy. Choose a reasonable set of tools to start with, ones that don’t create major barriers to getting the pipeline in place, and be ready to switch them out as you learn about how they work in the context of your project.

Do expect your design to change

Put your tracer bullet in place fully expecting that the choices you make for its architecture, technology, design, and workflow will all change. This doesn’t just apply to the pipeline, but to the infrastructure and application as well. Whatever decisions you make up front will need to be evaluated once you’ve got working software that you can test and use. Taking the attitude that these early choices will change later lowers the stakes of making those decisions, which in turn makes changing them less fraught. It’s a virtuous circle that encourges learning and adaptation.

Don’t relax the go-live constraints

It’s tempting to make it easy to get pre-live releases into the production environment, waiting until launch is close to impose the tighter restictions required for “real” use. This is a bad idea. The sooner the real-world constraints are in place, the quicker the issues those constraints cause will become visible. Once these issues are visible, you can implement the systems, processes, and tooling to deal with those issues, ensuring that you can routinely and easily release software that is secure, compliant, and stable.

Do involve everyone from the start

Another thing often left until the end is bringing in the people who will be involved in releasing and supporting the software. This is a mistake. In siloed organizations where software design and development is done by separate groups, the support people have deep insight into the requirements for making the operation and use of the software reliable and cost effective.

Involving them from the start and throughout the development process is the most effective way to build supportability into the software. When release time comes, handover becomes trivial because the support team have been supporting the application through its development.

Bringing release and support teams in just before release means their requirements are introduced when the project is nearly finished, which forces a choice between delaying the release in order to fix the issues, or else releasing software which is difficult and/or expensive to support.

Doing what’s right for the project and team

The question of what to include in the tracer bullet and what to build in once the project is up and running depends on the needs of the project and the knowledge of the team. On my current project, we found it easy to get a repeatable server build in place with chef configuration. But we did this with a number of shorcuts.

We’re using the out of the box server templates from our cloud vendor (Rackspace), even though we’ll probably want to roll our own eventually. We started out using chef-solo (with knife-solo), even though we planned to use chef-server. This is largely due to knowledge - I’ve done a few smaller projects with knife-solo, and have some scripts and things ready to use, but haven’t used chef-server. Now that we’re migrating to chef-server I’m thinking it would have been wiser to start with the Opscode hosted chef-server. Moving from the hosted server to our own would have been easier than moving from solo to server. Starting out with a tracer bullet approach to our pipeline has paid off. A week after starting development we have been able to demonstrate working code to our stakeholders. This in turn has made it easier to consider user testing, and perhaps even a beta release, far sooner than had originally been considered feasible.

Early Release of Infrastructure as Code 3rd edition

2024-03-12T09:01:00+00:00

I’ve been furiously typing away on the new edition of the book and now have a rough (very!) draft of the first eight chapters. You can get access to the Early Release of Infrastructure as Code 3ed on the O’Reilly Learning Platform (previously known as Safari).

The first eight chapters, of a planning 18 or so, are:

What Is Infrastructure as Code?
Principles of Cloud Infrastructure
Platforms and Toolchains
Defining Infrastructure as Code
Design Principles For Infrastructure as Code
Infrastructure Components
Design Patterns for Infrastructure Deployment Stacks
Configuring Stack Deployment Instances

I’ve updated quite a lot over the first two editions. In the earlier chapters I discuss organizational goals, and how to make sure your infratructure strategy and architecture support them.

The chapters on design and component bring in a lot of what I’ve learned over the past four or five years about how to structure infrastructure code for delivery, sharing, and reuse.

While revising the chapter “Defining Infrastructure as Code” I came up with a model for thinking about the different lifecycle contexts of infrastructure code that has proven useful throughout the rest of the book. This chapter is where I talk about the nature of infrastructure coding languages, and led me to think about the different lifecycle contexts of infrastructure code.

These contexts are editing code, deploying code (provisioning infrastructure), and using infrastructure resources, as shown in this diagram:

When we talk about infrastructure that we define as code, we often intermix these contexts, leading us to confuse ourselves. We also sometimes forget fundamental differences between application code and infrastructure code.

Application code executes in the runtime context, after having been deployed, while infrastructure code only executes when we deploy it. So, for example, when we write automated tests for procedural code written with Pulumi or CDK, we need to keep in mind exactly what the code is doing. The logic of our code results in a model of infrastructure to be provisioned, but doesn’t tell us how the infrastructure will behave. So we may need separate collections of tests for each context, one set being unit tests, the other testing the infrastructure that is provisioned afterwards.

Another area where this lifecycle context concept is useful is thinking about components. It’s very common to see teams try to deal with very large infrastructure projects by breaking them into code components like Terraform modules. The diagram below uses this to differentiate between a code library and a deployable infrastructure stack.

An infrastructure code library, like a Terraform module, Pulumi component resource, or CDK Level 3 construct, is useful to organize and share code. But it is only applied as part of an infrastructure stack like a Terraform project, Pulumi stack, CDK stack, or Crossplane composition.

This is why a major emphasis in my book, going back to the first edition, is designing infrastructure using separately deployable stacks as the main architectural unit.

I’m having fun working on this, and am looking forward to getting it published around the end of the year!

Infrastructure as Data

2023-08-12T09:09:00+01:00

Infrastructure as Data integrates declarative infrastructure management into a Kubernetes cluster, so you can write infrastructure code and use it with the Kubernetes ecosystem of tools and services.

ACK (AWS Controllers for Kubernetes) is a framework you can use to implementat Infrastructure as Data. It exposes AWS resources as Custom Resources (CRs) in a Kubernetes cluster. This makes them available to standard services and tools in the cluster, such as the kubetcl command-line tool, to provision and manage resources on the IaaS platform.

Crossplane is another Infrastructure as Data system. In addition to the ability to provision individual IaaS platform resources, Crossplane adds the capability to define and provision Compositions, which are collections of resources managed as a unit. In other words, an infrastructure stack.

Although some people describe Infrastructure as Data to be an alternative to Infrastructure as Code, I’d characterize it as simply another implementation. A Kubernetes cluster with infrastructure resource CRDs leverages the Kubernetes ecosystem for infrastructure management and creates options to integrate infrastructure management with application management workflows.

One example of leveraging Kubernetes is using operators to implement control loops. Once you define infrastructure resources in your cluster and provision them on your IaaS platform, a controller ensures the provisioned resources remain synchronized with the definition.

A particularly interesting opportunity is aligning the configuration and provisioning of infrastructure resources very closely with the applications that use them. The descriptors and tools that you use to configure and deploy an application, like a Helm chart, can reference Custom Resources (CR) for infrastructure. This way, infrastructure is provisioned, and de-provisioned, on demand along with the applications that use it.

This application-driven infrastructure provisioning model is a favorite theme of mine. Infrastructure as Data supports this by creating a separation of concerns between defining and configuring the infrastructure needed for an application and its implementation and execution. You can create a standard implementation of, for example, a secure database instance, and expose it in the cluster as a CR. Someone configuring an application deployment can specify that the application needs one of these instances and set its parameters.

Permissions needed to provision the database instance are given to the operator that is triggered by the application deployment process but not given to the application deployer. This creates a much stronger separation of permissions than would be needed for the application deployment script to implement the database provisioning. And it removes the dependency that would be needed if a separate team needed to provision the database instance.

This is an example of an empowering approach to platforms. The application team has the control to configure the database instance they need, rather than relying on someone in a separate platform or infrastructure team who doesn’t know the application’s needs as well. A central team may ensure that the database CR is implemented well and in line with governance and compliance requirements, without needing to personally implement and configure every instance used in their organization.

See I do declare! Infrastructure automation with Configuration as Data, by Kelsey Hightower and Mark Balch.

Thanks to Mohamed Abbas, Thien-An Mac, and Reinaldo de Souza for an informative conversation on the internal Thoughtworks infrastructure community chat group.

Structuring code repositories

2023-05-18T11:10:10+01:00

Given that you have multiple code projects, should you put them all in a single repository in your source control system, or spread them among more than one? If you use more than one repository, should every project have its own repository, or should you group some projects together into shared repositories? If you arrange multiple projects into repositories, how should you decide which ones to group and which ones to separate?

There are some trade-off factors to consider:

Separating projects into different repositories makes it easier to maintain boundaries at the code level.
Having multiple teams working on code in a single repository can add overhead and create conflicts.
Spreading code across multiple repositories can complicate working on changes that cross them.
Code kept in the same repository is versioned and can be branched together, which simplifies some project integration and delivery strategies.
Different source code management systems (such as Git, Perforce, and Mercurial) have different performance and scalability characteristics and features to support complex scenarios.

Let’s look at the main options for organizing projects across repositories in light of these factors.

One Repository for Everything

Some teams, and even some larger organizations, maintain a single repository with all of their code. This requires source control system software that can scale to your usage level. Some software struggles to handle a codebase as it grows in size, history, number of users, and activity level. So splitting repositories becomes a matter of managing performance.

Facebook, Google, and Microsoft all use very large repositories. All three have either made custom changes to their version control software or built their own. See Scaling version control software for more. Also see “Scaled trunk-based development” by Paul Hammant for insight on the history of Google’s approach.

A single repository can be easier to use. People can check out all of the projects they need to work on, guaranteeing they have a consistent version of everything. Some version control software offers features, like sparse-checkout, which let a user work with a subset of the repository.

Monorepo: One Repository, One Build

A single repository works well to integrate dependencies across projects at build-time. So the monorepo strategy uses build-time integration for projects maintained in a single repository. A simplistic version of monorepo builds all of the projects in the repository:

Although the projects are built together, they may produce multiple artifacts, such as application packages, infrastructure stacks, and server images.

One repository, multiple builds

Most organizations that keep all of their projects in a single repository don’t necessarily run a single build across them all. They often have a few different builds to build different subsets of their system:

Often, these builds will share some projects. For instance, two different builds may use the same shared library:

One pitfall of managing multiple projects this way is that it can blur the boundaries between projects. People may write code for one project that refers directly to files in another project in the repository. Doing this leads to tighter coupling and less visibility of dependencies. Over time, projects become tangled and hard to maintain, because a change to a file in one project can have unexpected conflicts with other projects.

A Separate Repository for Each Project (Microrepo)

Having a separate repository for each project is the other extreme:

This strategy ensures a clean separation between projects, especially when you have a pipeline that builds and tests each project separately before integrating them. If someone checks out two projects and makes a change to files across projects, the pipeline will fail, exposing the problem.

Technically, you could use build-time integration across projects managed in separate repositories, by first checking out all of the builds:

But it’s more practical to build across multiple projects in a single repository because then their code is versioned together. Pushing changes for a single build to multiple repositories complicates the delivery process. The delivery stage would need some way to know which versions of all of the involved repositories to check out to create a consistent build.

Single-project repositories work best when supporting delivery-time and apply-time integration. A change to any one repository triggers the delivery process for its project, bringing it together with other projects later in the flow.

Multiple Repositories with Multiple Projects

While some organizations push toward one extreme or the other — single repository for everything, or a separate repository for each project — most maintain multiple repositories with more than one project:

Often, the grouping of projects into repositories happens organically, rather than being driven by a strategy like monorepo or microrepo. However, there are a few factors that influence how smoothly things work.

One factor, as seen in the discussions of the other repository strategies, is the alignment of a project grouping with its build and delivery strategy. Keep projects in a single repository when they are closely related, especially when you integrate the projects at build time. Consider separating projects into separate repositories when their delivery paths aren’t tightly integrated.

Another factor is team ownership. Although multiple people and teams can work on different projects in the same repository, it can be distracting. Changelogs intermingle commit history from different teams with unrelated workstreams. Some [.keep-together]#organizations# restrict access to code. Access control for source control systems is often managed by the repository, which is another driver for deciding which projects go where.

As mentioned for single repositories, projects within a repository more easily become tangled together with file dependencies. So teams might divide projects between repositories based on where they need stronger boundaries from an architectural and design perspective.

Unpacking Dan North’s CUPID properties for joyful coding

2022-02-23T22:20:20+00:00

Dan North has recently published his long-awaited list of CUPID properties for making software a joy to work with. Dan teased CUPID almost a year earlier in a post that declared that every single element of SOLID is wrong. CUPID is what Dan is proposing as the next level of thinking about the design of code.

CUPID is a novel approach to thinking about software design, forcing Dan to cover a fair bit of meta content before getting into CUPID itself. I found it a lot to take in because of having to stop and chew over these foundational concepts and asides. I’m writing this to help me to do this, so I can then consider how to use his ideas to develop my own thoughts on infrastructure code design. I’ll write a follow-up post to this one to go into those thoughts.

Let’s make code joyful to work with

The first novel thing Dan does with CUPID is give it the goal of making code joyful. He quotes Martin Fowler, “Good programmers write code that humans can understand,” and takes it to the next level - write code that humans enjoy reading and working with. Dan selected the CUPID properties, which we’ll eventually get to, for their value in looking at how joyful a codebase is to work with.

Using properties of a design rather than design principles

The next novel thing in Dan’s approach to CUPID is to discard the idea of defining principles for design, and instead consider properties of a codebase’s design. So we need to grok properties over principles. As Dan sees it, properties are:

qualities or characteristics of code rather than rules to follow. Properties define a goal or centre to move towards. Your code is only closer to or further from the centre, and there is always a clear direction of travel. You can use properties as a lens or filter to assess your code and you can decide which ones to address next.

What makes a property useful

If we’re going to list properties that make software joyful, we need to decide what makes a good property. So Dan next looks at the properties of properties. The properties Dan aims for with the CUPID properties are:

Practical: easy to articulate, easy to assess, easy to adopt.
Human: read from the perspective of people (developers), not code
Layered: offer guidance for beginners and nuance for more experienced folks

Dan discusses these in a bit more detail, so go ahead and read them. And now we can get into CUPID itself.

The CUPID properties

Dan defines five properties, which, in one of the few ways he emulates SOLID, he’s given names to make up the acronym to name the set. He expands a bit on each one (he’s promised to write full posts on each one later on), which I’ll summarize here.

Composable: Plays well with others. Small surface area. Intention revealing. Minimal dependencies. (This plays heavily in my thinking about infrastructure code design.)
Unix philosophy: Does one thing well. A simple, consistent model. Single-purpose vs. single responsibility.
Predictable: Does what you expect. Behaves as expected. Deterministic. Observable. (Ooh, how can we design observability into our infrastructure code? Also, I should make it a habit to consider writing characterization tests for my infra code.)
Idiomatic: Feels natural. (Avoid extraneous cognitive load). Language idioms. Local idioms. (I’m thinking it’s hard to write design properties without falling into prescriptive phrasing like “Follow language idioms”.)
Domain-based: The solution domain models the problem domain in language and structure. Domain based language. Domain based structure. Domain based boundaries. (Current norms for infrastructure code are quite far from this, another thing I want to think more deeply about.)

The Snowflakes as Code antipattern

2021-11-19T13:30:00+00:00

One of the earliest benefits that drew people like me to infrastructure as code was the promise of eliminating snowflake servers.

In the old times, we built servers by logging into them and running commands. We might build, update, fix, optimize, or otherwise change servers in different environments in different ways at different times. This led to configuration drift, inconsistencies across environments.

Thanks to snowflakes and configuration drift, we spent huge amounts of effort to get an application build that worked fine in the development environment to deploy and run in production.

Flash forward 10+ years, infrastructure as code has become commonplace, helping us to manage all kinds of stuff in addition to, and often instead of, servers. You’d think snowflake infrastructure would be a thing of the past.

But it’s actually quite common to see people following practices that lead to differences between instances of infrastructure - snowflakes as code.

Antipattern: Snowflakes as code

Snowflakes as code is an antipattern where separate instances of infrastructure code are maintained for multiple instances of infrastructure that are intended to be essentially the same.

A common example is when multiple environments are provisioned as separate instances of infrastructure, each with its own separate copy of the code. These code instances are snowflakes when differences between the infrastructure instances are maintained by differences in the code.

When someone makes a change to the code for one instance, they copy or merge the change to other instances. The process for doing this is usually manual, and involves effort and care to ensure that deliberate differences between instances are maintained, while avoiding unintended differences.

This antipattern also occurs when infrastructure is replicated for different deployments of similar applications - for different customers, for example - or to deploy multiple application instances in different regions.

Motivation

Different instance of infrastructure, even ones intended to be consistent, will always need some variations between them. Resources like clusters and storage may be sized differently for a test environment than for production, for example. If nothing else, resources may need different names, such as database-test, database-staging, and database-prod.

Maintaining a separate copy of infrastructure code for each instance is an obvious way to handle these variations.

Consequences

The issue with maintaining different versions of infrastructure code for instances that are intended to be similar is that it encourages inconsistency - configuration drift. Once you accept editing code when copying or merging it between instances as a way to handle configuration, it becomes easy for larger differences to persist. For example:

I make a fix to the production infrastructure, but don’t have time to copy it back to upstream environments. The fix then clashes with changes you make in upstream environments.
I’m working on a fairly complex change in the staging environment that drags on for days, or longer. Meanwhile, you need to make a small, quick fix and take it into production. Testing in staging becomes unreliable because it doesn’t currently reflect production.
We need to define security policies differently in production than for non-production environments. We implement this with different code in each environment, and hope nobody accidentally copies the wrong file to the wrong place.

Another consequence is the likelihood of making a mistake when copying or merging changes from one instance to the next. Don’t forget to copy/replace every instance of staging to prod! Don’t forget to change the maximum node count for the database cluster from 2 to 6! Ooops!

Implementation

The two main ways people implement snowflakes as code are folders and branches.

Teams who use branches to maintain infrastructure code for each of their environments (as described below under Implementation) often do this because they are using GitOps. GitOps uses tools that apply code from git branches to the infrastructure, so encourages maintaining a separate branch for each environment.

It’s possible to use branches this way without them becoming snowflakes, as described below in Alternatives. But when your process for promoting code involves merging and tweaking code to maintain instance-specific differences, then you’ve got snowflakes as code.

Other teams use a folder structure to maintain separate projects for each environment. They copy and edit code between projects to make changes across environments. Again, it’s the need to edit files when copying them to a new environment that signals this antipattern.

Alternatives

An alternative to snowflakes as code is to reuse a single instance of infrastructure code for multiple instances of the infrastructure.

You can maintain multiple versions of the code so that you can apply changes to different instances at different times, for example so you can have a pipeline to deliver changes to environments in a path to production.

But code for an existing version should never be edited. This is Continuous Delivery 101 - only make changes in the origin (for example, trunk), then copy the code, unmodified, from one environment to the next.

Using an automated process to promote infrastructure code from one instances to the next reduces the opportunity for manual errors. It also removes the opportunity to “tweak” code to maintain differences across environments, forcing better discipline.

If the need for a change is discovered in a downstream environment, the change is first made to the origin, then progressed to the downstream environment without modifications. This ensures that every code change has been put through all of the tests and approvals needed.

As mentioned earlier, there usually is a need for some variations between instances, such as resource sizing and names. These variations should be extracted into per-instance configuration values, and passed to the code when it is applied to the given instance. Chapter 7 of my book covers different patterns for doing this, including configuration files and configuration registries.

Many teams follow the common development pipeline pattern of having a build stage that bundles the infrastructure code into a versioned artifact, storing it in a repository, and using that to ensure consistency of code from one environment to the next. A simple implementation of this pattern can be implemented using tarballs and centralized storage like an S3 bucket.

Tools like Terraform support multiple instances of infrastructure with different versions of the same code using workspaces.

Why your team doesn’t need to use pull requests

2021-01-02T11:57:00+00:00

Github introduced the pull request practice, and features to support it, to make it easier for people who run open-source projects to accept contributions from outside their group of trusted committers.

Committers are trusted to make changes to the codebase routinely. But a change from a random outsider needs to be assessed to make sure it works, doesn’t take the project in an unwanted direction, and meets the standards for style and quality. The outsider packages their proposed change as a pull request, which a committer can easily review and manage as a unit before merging it into the codebase.

Figure 1: Pull request process

Although designed to make it easier to accept contributions from untrusted people outside a team, many teams now use pull requests for people inside their own team. This practice has become so common that many people consider it a default, “best” practice. Some people assume there is no other way to make sure code is reviewed because they’ve never seen anything else.

However, pull requests sacrifice performance, including both delivery time and quality. This is a sacrifice worth making to manage the risk of accepting changes from unknown people. An outsider may not understand the vision and direction of your project. They may not have the same habits and norms for testing, code quality, and style. However, your own team members should share these norms.

Using pull requests for code changes by your own team members is like having your family members go through an airport security checkpoint to enter your home. It’s a costly solution to a different problem.

Using Continuous Integration rather than pull requests

A software delivery process should optimize for flow and quality. Keep the lead time for changes low, and give fast feedback when a change introduces a problem. This is the idea that underpins Continuous Integration (CI). CI is the practice of continuously merging and testing everyone’s code as they work on it.

Figure 2: Continuous Integration process

“As they work on it” is essential. As a team member, you don’t wait until you have finished a feature or story to integrate your code to the mainline. Instead, you frequently - at least once a day - put your code into a healthy state that passes tests and integrate it into the mainline with everyone else’s current work. (Also see Martin Fowler’s article on branching patterns and Paul Hammant’s trunk-based development site.)

A CI build job automatically tests the project’s mainline every time you push a change. This means you find out immediately if what you’re doing clashes with something another person is working on before either of you has invested too much time. It sucks to think you’ve finished a story or feature, only to discover you’ve got to go back and untangle and redo several days of effort.

Figure 3: Tests run on integrated code on every push

The trouble with pull requests

A pull request introduces a delay to integration. When you complete work that you consider ready to integrate with the rest of the team, you create a pull request and wait for someone to review it. Only after someone else reviews the change do they integrate it with the mainline.

If team members are quick to review and integrate pull requests, this is only slightly slower than CI. Maybe they respond and review your change within 30 minutes every time you push. Your code change is integrated with the mainline and automated tests run against it. So you may discover a clash with someone else’s work after 30-40 minutes or so.

Figure 4: Delays in feedback with pull requests versus CI

In practice, not many teams reliably turn pull requests around in under 30 minutes. While waiting for someone to review your change, you may switch to another task or start working on a new change. When you find out there was a problem, you need to switch gears back to the original change, disrupting your flow of work.

An effective CI build, on the other hand, should finish testing your integrated code within a few minutes after you push it - up to 10 minutes in our scenario. You discover that clash almost immediately, so you can investigate and fix it while it’s fresh in your mind.

You don’t need to interrupt someone else’s work to ask them to review it before you get the feedback from testing fully integrated code. As I’ll explain shortly, you may still have someone review your changes. But you can take advantage of a faster cycle time to commit, integrate, and test your code to make multiple changes before asking them to review.

Even if everyone in the team turns pull requests around quickly, the typical practice is to wait until completing work on a feature or story before integrating a pull request with the mainline. Most teams take longer than a day, on average, to develop a story. So a typical pull request process doesn’t meet the minimum requirement of Continuous Integration to integrate everyone’s work at least daily.

Working in a rhythm of coding, pulling, testing, pushing, and getting feedback from integrated tests several times a day is electrifying. And it isn’t possible with pull requests that introduce a human delay into the rhythm.

Better ways to review code changes

When the topic of CI versus pull requests comes up, someone inevitably defends pull requests as necessary to get feedback from other team members on changes.

It is essential to have a second pair of eyes (if not more) looking at code changes. Humans catch problems that tests don’t, especially problems related to maintainability and sound design. Having people review each others’ code also helps the team converge on norms for coding style, programming idioms, and quality expectations. And in some cases, such as regulated environments, having each change reviewed by a second person is required.

However, the recent popularity of pull requests seems to have resulted in some people assuming there are no other ways to review code changes. Here are a few practices that you can use instead, without interrupting the Continuous Integration feedback cycle. Keep in mind that it’s entirely possible to combine more than one of these as appropriate.

Figure 5: Pairing for immediate, continuous code review

Pair programming: No form of code review is more effective than pairing. Feedback is immediate, so there is a far higher chance you will use it to make improvements. If someone tells you as you write some code that there’s a better way, you can stop, learn, and write it in that better way, right then. If someone tells you a day later, you might take it on board for future reference. But it needs to be a serious problem to get you to stop your current work to go back and redo something you’ve already finished.

Periodic reviews: If a review is not explicitly required for compliance, it may not need to be a gate for each code change. You might have regular, scheduled reviews, for example weekly, where people check through code changes since the last review. This can be especially potent as a group exercise since it creates conversations that help people learn and shape the team’s norms for coding.

Pipeline approvals: If your team uses a Continuous Delivery pipeline to deliver changes to production, you can include a stage that requires someone to authorize the change to progress. This is conceptually similar to a pull request in that it is a gate in the delivery process, but you place the gate after code integration and automated tests. Doing this means that a human only spends time reviewing code that has already been proven technically correct.

Figure 6: Review changes after they are integrated and tested

Conclusion

Pull requests differ from Continuous Integration in having a human review a code change after writing it but before integrating it with the mainline. This creates a delay in getting feedback from automated tests against fully integrated code.

With Continuous Integration, code is either reviewed as it is written (pairing), or after it is integrated and tested. Optimizing the loop for integrating and testing changes means you can run this loop more frequently. A more frequent coding and integration loop encourages developers to make smaller and more frequent commits, which improves quality and flow.