Early Release of Infrastructure as Code 3rd edition

I’ve been furiously typing away on the new edition of the book and now have a rough (very!) draft of the first eight chapters. You can get access to the Early Release of Infrastructure as Code 3ed on the O’Reilly Learning Platform (previously known as Safari).

The first eight chapters, of a planning 18 or so, are:

  1. What Is Infrastructure as Code?
  2. Principles of Cloud Infrastructure
  3. Platforms and Toolchains
  4. Defining Infrastructure as Code
  5. Design Principles For Infrastructure as Code
  6. Infrastructure Components
  7. Design Patterns for Infrastructure Deployment Stacks
  8. Configuring Stack Deployment Instances

I’ve updated quite a lot over the first two editions. In the earlier chapters I discuss organizational goals, and how to make sure your infratructure strategy and architecture support them.

The chapters on design and component bring in a lot of what I’ve learned over the past four or five years about how to structure infrastructure code for delivery, sharing, and reuse.

While revising the chapter “Defining Infrastructure as Code” I came up with a model for thinking about the different lifecycle contexts of infrastructure code that has proven useful throughout the rest of the book. This chapter is where I talk about the nature of infrastructure coding languages, and led me to think about the different lifecycle contexts of infrastructure code.

These contexts are editing code, deploying code (provisioning infrastructure), and using infrastructure resources, as shown in this diagram:

Infrastructure lifecycle contexts

When we talk about infrastructure that we define as code, we often intermix these contexts, leading us to confuse ourselves. We also sometimes forget fundamental differences between application code and infrastructure code.

Application code executes in the runtime context, after having been deployed, while infrastructure code only executes when we deploy it. So, for example, when we write automated tests for procedural code written with Pulumi or CDK, we need to keep in mind exactly what the code is doing. The logic of our code results in a model of infrastructure to be provisioned, but doesn’t tell us how the infrastructure will behave. So we may need separate collections of tests for each context, one set being unit tests, the other testing the infrastructure that is provisioned afterwards.

Another area where this lifecycle context concept is useful is thinking about components. It’s very common to see teams try to deal with very large infrastructure projects by breaking them into code components like Terraform modules. The diagram below uses this to differentiate between a code library and a deployable infrastructure stack.

Infrastructure components in context

An infrastructure code library, like a Terraform module, Pulumi component resource, or CDK Level 3 construct, is useful to organize and share code. But it is only applied as part of an infrastructure stack like a Terraform project, Pulumi stack, CDK stack, or Crossplane composition.

This is why a major emphasis in my book, going back to the first edition, is designing infrastructure using separately deployable stacks as the main architectural unit.

I’m having fun working on this, and am looking forward to getting it published around the end of the year!


Infrastructure as Data

Infrastructure as Data integrates declarative infrastructure management into a Kubernetes cluster, so you can write infrastructure code and use it with the Kubernetes ecosystem of tools and services.

ACK (AWS Controllers for Kubernetes) is a framework you can use to implementat Infrastructure as Data. It exposes AWS resources as Custom Resources (CRs) in a Kubernetes cluster. This makes them available to standard services and tools in the cluster, such as the kubetcl command-line tool, to provision and manage resources on the IaaS platform.

Crossplane is another Infrastructure as Data system. In addition to the ability to provision individual IaaS platform resources, Crossplane adds the capability to define and provision Compositions, which are collections of resources managed as a unit. In other words, an infrastructure stack.

Although some people describe Infrastructure as Data to be an alternative to Infrastructure as Code, I’d characterize it as simply another implementation. A Kubernetes cluster with infrastructure resource CRDs leverages the Kubernetes ecosystem for infrastructure management and creates options to integrate infrastructure management with application management workflows.

One example of leveraging Kubernetes is using operators to implement control loops. Once you define infrastructure resources in your cluster and provision them on your IaaS platform, a controller ensures the provisioned resources remain synchronized with the definition.

A particularly interesting opportunity is aligning the configuration and provisioning of infrastructure resources very closely with the applications that use them. The descriptors and tools that you use to configure and deploy an application, like a Helm chart, can reference Custom Resources (CR) for infrastructure. This way, infrastructure is provisioned, and de-provisioned, on demand along with the applications that use it.

This application-driven infrastructure provisioning model is a favorite theme of mine. Infrastructure as Data supports this by creating a separation of concerns between defining and configuring the infrastructure needed for an application and its implementation and execution. You can create a standard implementation of, for example, a secure database instance, and expose it in the cluster as a CR. Someone configuring an application deployment can specify that the application needs one of these instances and set its parameters.

Permissions needed to provision the database instance are given to the operator that is triggered by the application deployment process but not given to the application deployer. This creates a much stronger separation of permissions than would be needed for the application deployment script to implement the database provisioning. And it removes the dependency that would be needed if a separate team needed to provision the database instance.

This is an example of an empowering approach to platforms. The application team has the control to configure the database instance they need, rather than relying on someone in a separate platform or infrastructure team who doesn’t know the application’s needs as well. A central team may ensure that the database CR is implemented well and in line with governance and compliance requirements, without needing to personally implement and configure every instance used in their organization.

See I do declare! Infrastructure automation with Configuration as Data, by Kelsey Hightower and Mark Balch.

Thanks to Mohamed Abbas, Thien-An Mac, and Reinaldo de Souza for an informative conversation on the internal Thoughtworks infrastructure community chat group.


Structuring code repositories

Given that you have multiple code projects, should you put them all in a single repository in your source control system, or spread them among more than one? If you use more than one repository, should every project have its own repository, or should you group some projects together into shared repositories? If you arrange multiple projects into repositories, how should you decide which ones to group and which ones to separate?

There are some trade-off factors to consider:

  • Separating projects into different repositories makes it easier to maintain boundaries at the code level.
  • Having multiple teams working on code in a single repository can add overhead and create conflicts.
  • Spreading code across multiple repositories can complicate working on changes that cross them.
  • Code kept in the same repository is versioned and can be branched together, which simplifies some project integration and delivery strategies.
  • Different source code management systems (such as Git, Perforce, and Mercurial) have different performance and scalability characteristics and features to support complex scenarios.

Let’s look at the main options for organizing projects across repositories in light of these factors.

One Repository for Everything

Some teams, and even some larger organizations, maintain a single repository with all of their code. This requires source control system software that can scale to your usage level. Some software struggles to handle a codebase as it grows in size, history, number of users, and activity level. So splitting repositories becomes a matter of managing performance.

Facebook, Google, and Microsoft all use very large repositories. All three have either made custom changes to their version control software or built their own. See Scaling version control software for more. Also see “Scaled trunk-based development” by Paul Hammant for insight on the history of Google’s approach.

A single repository can be easier to use. People can check out all of the projects they need to work on, guaranteeing they have a consistent version of everything. Some version control software offers features, like sparse-checkout, which let a user work with a subset of the repository.

Monorepo: One Repository, One Build

A single repository works well to integrate dependencies across projects at build-time. So the monorepo strategy uses build-time integration for projects maintained in a single repository. A simplistic version of monorepo builds all of the projects in the repository:

Building all projects in a repository

Although the projects are built together, they may produce multiple artifacts, such as application packages, infrastructure stacks, and server images.

One repository, multiple builds

Most organizations that keep all of their projects in a single repository don’t necessarily run a single build across them all. They often have a few different builds to build different subsets of their system:

Building different combinations of projects from one repository

Often, these builds will share some projects. For instance, two different builds may use the same shared library:

Sharing a component across builds in a single repository

One pitfall of managing multiple projects this way is that it can blur the boundaries between projects. People may write code for one project that refers directly to files in another project in the repository. Doing this leads to tighter coupling and less visibility of dependencies. Over time, projects become tangled and hard to maintain, because a change to a file in one project can have unexpected conflicts with other projects.

A Separate Repository for Each Project (Microrepo)

Having a separate repository for each project is the other extreme:

Each project in a separate repository

This strategy ensures a clean separation between projects, especially when you have a pipeline that builds and tests each project separately before integrating them. If someone checks out two projects and makes a change to files across projects, the pipeline will fail, exposing the problem.

Technically, you could use build-time integration across projects managed in separate repositories, by first checking out all of the builds:

A single build across multiple repositories

But it’s more practical to build across multiple projects in a single repository because then their code is versioned together. Pushing changes for a single build to multiple repositories complicates the delivery process. The delivery stage would need some way to know which versions of all of the involved repositories to check out to create a consistent build.

Single-project repositories work best when supporting delivery-time and apply-time integration. A change to any one repository triggers the delivery process for its project, bringing it together with other projects later in the flow.

Multiple Repositories with Multiple Projects

While some organizations push toward one extreme or the other — single repository for everything, or a separate repository for each project — most maintain multiple repositories with more than one project:

Multiple repositories with multiple projects

Often, the grouping of projects into repositories happens organically, rather than being driven by a strategy like monorepo or microrepo. However, there are a few factors that influence how smoothly things work.

One factor, as seen in the discussions of the other repository strategies, is the alignment of a project grouping with its build and delivery strategy. Keep projects in a single repository when they are closely related, especially when you integrate the projects at build time. Consider separating projects into separate repositories when their delivery paths aren’t tightly integrated.

Another factor is team ownership. Although multiple people and teams can work on different projects in the same repository, it can be distracting. Changelogs intermingle commit history from different teams with unrelated workstreams. Some [.keep-together]#organizations# restrict access to code. Access control for source control systems is often managed by the repository, which is another driver for deciding which projects go where.

As mentioned for single repositories, projects within a repository more easily become tangled together with file dependencies. So teams might divide projects between repositories based on where they need stronger boundaries from an architectural and design perspective.


Unpacking Dan North's CUPID properties for joyful coding

Dan North has recently published his long-awaited list of CUPID properties for making software a joy to work with. Dan teased CUPID almost a year earlier in a post that declared that every single element of SOLID is wrong. CUPID is what Dan is proposing as the next level of thinking about the design of code.

The sculpture Cupid by Bertel Thorvaldsen

CUPID is a novel approach to thinking about software design, forcing Dan to cover a fair bit of meta content before getting into CUPID itself. I found it a lot to take in because of having to stop and chew over these foundational concepts and asides. I’m writing this to help me to do this, so I can then consider how to use his ideas to develop my own thoughts on infrastructure code design. I’ll write a follow-up post to this one to go into those thoughts.

Let’s make code joyful to work with

The first novel thing Dan does with CUPID is give it the goal of making code joyful. He quotes Martin Fowler, “Good programmers write code that humans can understand,” and takes it to the next level - write code that humans enjoy reading and working with. Dan selected the CUPID properties, which we’ll eventually get to, for their value in looking at how joyful a codebase is to work with.

Using properties of a design rather than design principles

The next novel thing in Dan’s approach to CUPID is to discard the idea of defining principles for design, and instead consider properties of a codebase’s design. So we need to grok properties over principles. As Dan sees it, properties are:

qualities or characteristics of code rather than rules to follow. Properties define a goal or centre to move towards. Your code is only closer to or further from the centre, and there is always a clear direction of travel. You can use properties as a lens or filter to assess your code and you can decide which ones to address next.

What makes a property useful

If we’re going to list properties that make software joyful, we need to decide what makes a good property. So Dan next looks at the properties of properties. The properties Dan aims for with the CUPID properties are:

  • Practical: easy to articulate, easy to assess, easy to adopt.
  • Human: read from the perspective of people (developers), not code
  • Layered: offer guidance for beginners and nuance for more experienced folks

Dan discusses these in a bit more detail, so go ahead and read them. And now we can get into CUPID itself.

The CUPID properties

Dan defines five properties, which, in one of the few ways he emulates SOLID, he’s given names to make up the acronym to name the set. He expands a bit on each one (he’s promised to write full posts on each one later on), which I’ll summarize here.

  • Composable: Plays well with others. Small surface area. Intention revealing. Minimal dependencies. (This plays heavily in my thinking about infrastructure code design.)
  • Unix philosophy: Does one thing well. A simple, consistent model. Single-purpose vs. single responsibility.
  • Predictable: Does what you expect. Behaves as expected. Deterministic. Observable. (Ooh, how can we design observability into our infrastructure code? Also, I should make it a habit to consider writing characterization tests for my infra code.)
  • Idiomatic: Feels natural. (Avoid extraneous cognitive load). Language idioms. Local idioms. (I’m thinking it’s hard to write design properties without falling into prescriptive phrasing like “Follow language idioms”.)
  • Domain-based: The solution domain models the problem domain in language and structure. Domain based language. Domain based structure. Domain based boundaries. (Current norms for infrastructure code are quite far from this, another thing I want to think more deeply about.)

The Snowflakes as Code antipattern

One of the earliest benefits that drew people like me to infrastructure as code was the promise of eliminating snowflake servers.

In the old times, we built servers by logging into them and running commands. We might build, update, fix, optimize, or otherwise change servers in different environments in different ways at different times. This led to configuration drift, inconsistencies across environments.

Thanks to snowflakes and configuration drift, we spent huge amounts of effort to get an application build that worked fine in the development environment to deploy and run in production.

Flash forward 10+ years, infrastructure as code has become commonplace, helping us to manage all kinds of stuff in addition to, and often instead of, servers. You’d think snowflake infrastructure would be a thing of the past.

But it’s actually quite common to see people following practices that lead to differences between instances of infrastructure - snowflakes as code.

Antipattern: Snowflakes as code

Snowflakes as code is an antipattern where separate instances of infrastructure code are maintained for multiple instances of infrastructure that are intended to be essentially the same.

Multiple environment infrastructure instances, each with its own set of code

A common example is when multiple environments are provisioned as separate instances of infrastructure, each with its own separate copy of the code. These code instances are snowflakes when differences between the infrastructure instances are maintained by differences in the code.

When someone makes a change to the code for one instance, they copy or merge the change to other instances. The process for doing this is usually manual, and involves effort and care to ensure that deliberate differences between instances are maintained, while avoiding unintended differences.

This antipattern also occurs when infrastructure is replicated for different deployments of similar applications - for different customers, for example - or to deploy multiple application instances in different regions.

Motivation

Different instance of infrastructure, even ones intended to be consistent, will always need some variations between them. Resources like clusters and storage may be sized differently for a test environment than for production, for example. If nothing else, resources may need different names, such as database-test, database-staging, and database-prod.

Maintaining a separate copy of infrastructure code for each instance is an obvious way to handle these variations.

Consequences

The issue with maintaining different versions of infrastructure code for instances that are intended to be similar is that it encourages inconsistency - configuration drift. Once you accept editing code when copying or merging it between instances as a way to handle configuration, it becomes easy for larger differences to persist. For example:

  • I make a fix to the production infrastructure, but don’t have time to copy it back to upstream environments. The fix then clashes with changes you make in upstream environments.
  • I’m working on a fairly complex change in the staging environment that drags on for days, or longer. Meanwhile, you need to make a small, quick fix and take it into production. Testing in staging becomes unreliable because it doesn’t currently reflect production.
  • We need to define security policies differently in production than for non-production environments. We implement this with different code in each environment, and hope nobody accidentally copies the wrong file to the wrong place.

Another consequence is the likelihood of making a mistake when copying or merging changes from one instance to the next. Don’t forget to copy/replace every instance of staging to prod! Don’t forget to change the maximum node count for the database cluster from 2 to 6! Ooops!

Implementation

The two main ways people implement snowflakes as code are folders and branches.

Environment folders and environment branches

Teams who use branches to maintain infrastructure code for each of their environments (as described below under Implementation) often do this because they are using GitOps. GitOps uses tools that apply code from git branches to the infrastructure, so encourages maintaining a separate branch for each environment.

It’s possible to use branches this way without them becoming snowflakes, as described below in Alternatives. But when your process for promoting code involves merging and tweaking code to maintain instance-specific differences, then you’ve got snowflakes as code.

Other teams use a folder structure to maintain separate projects for each environment. They copy and edit code between projects to make changes across environments. Again, it’s the need to edit files when copying them to a new environment that signals this antipattern.

Alternatives

An alternative to snowflakes as code is to reuse a single instance of infrastructure code for multiple instances of the infrastructure.

You can maintain multiple versions of the code so that you can apply changes to different instances at different times, for example so you can have a pipeline to deliver changes to environments in a path to production.

But code for an existing version should never be edited. This is Continuous Delivery 101 - only make changes in the origin (for example, trunk), then copy the code, unmodified, from one environment to the next.

Using an automated process to promote infrastructure code from one instances to the next reduces the opportunity for manual errors. It also removes the opportunity to “tweak” code to maintain differences across environments, forcing better discipline.

If the need for a change is discovered in a downstream environment, the change is first made to the origin, then progressed to the downstream environment without modifications. This ensures that every code change has been put through all of the tests and approvals needed.

As mentioned earlier, there usually is a need for some variations between instances, such as resource sizing and names. These variations should be extracted into per-instance configuration values, and passed to the code when it is applied to the given instance. Chapter 7 of my book covers different patterns for doing this, including configuration files and configuration registries.

Separating infrastructure code and per-instance configuration

Many teams follow the common development pipeline pattern of having a build stage that bundles the infrastructure code into a versioned artifact, storing it in a repository, and using that to ensure consistency of code from one environment to the next. A simple implementation of this pattern can be implemented using tarballs and centralized storage like an S3 bucket.

Tools like Terraform support multiple instances of infrastructure with different versions of the same code using workspaces.