Many engineering leaders I talk to are frustrated with their infrastructure automation. Adopting the cloud and Infrastructure as Code was supposed to speed up software delivery by providing infrastructure on tap. Instead, development teams seem to be constantly blocked waiting for environments to be built, updated, modified, extended, and fixed. There never seem to be enough environments, and yet cloud costs spiral. Messy and fragile environments are a bottleneck for developing and releasing changes. Infrastructure and platform teams are overstretched while technical debt piles up.

When my colleagues and I carry out software delivery effectiveness assessments, we often find time and effort wasted providing teams with the infrastructure they need. Value stream mapping shows teams losing time waiting for infrastructure work to be implemented, going back and forth to clarify needs and iterate as needs change during development, and troubleshooting and improving systems once they’ve gone live.

This is the first of several posts I’m working on to explore the question of how to improve software delivery effectiveness by empowering software delivery teams to provide infrastructure for themselves.

Infrastructure as Code is the dominant paradigm for automating infrastructure management today. But it creates plenty of pain points, especially for software teams that would like to manage their own infrastructure. A number of potential solutions that have emerged over the past few years, and new ones emerging now, focus on removing these pain points by improving the user experience of working with dynamic infrastructure.

Can software delivery be made more effective by giving software developers a friendlier, more effective interface for defining and provisioning infrastructure for themselves? Three potential approaches are offering more developer-friendly languages for writing infrastructure code, giving them GenAI chatbots or agents to provision code for them, or replacing infrastructure code with dynamic and extensible models or graphs.

Developer friendly languages

Traditional Infrastructure as Code tools provide special-purpose declarative DSLs (Domain Specific Languages), like Terraform, Ansible, and CloudFormation as an interface for defining the infrastructure you want. The tool generates a model of desired state, compares it with the reality, and then executes the changes to make the reality match the intention.

Many software developers I know find building infrastructure with these tools painful. They believe that alternative tools like Pulumi and AWS CDK that let them use general-purpose, imperative languages will remove the barriers for developers to manage infrastructure without relying on a separate team of infrastructure specialists.

Dynamic programming languages are more powerful than declarative languages for some areas of infrastructure automation, especially building component libraries. But they don’t really make low-level infrastructure coding more accessible to software developers. If you don’t have a solid understanding of networking, it isn’t any easier to wire together subnets, routing tables, firewall rules, gateways, and load balancers into safe and useful application connectivity in Python than it is with Terraform’s HCL. And Typescript doesn’t save you from needing to understand the tradeoffs of the many configurations of S3 bucket configuration for whichever one of the dozens of different purposes you might be using one for.

Some infrastructure experts are more comfortable using a general purpose language. And imperative languages are more appropriate than declarative languages for building component libraries. But they don’t reduce the effort or expertise needed to manage infrastructure for software delivery teams, so they don’t resolve the problem of making an organization’s software delivery process more effective.

Infrastructure from (application) Code

Some tools and platforms take the idea of developers writing infrastructure code a step further by making it possible to embed infrastructure code with application code. Platforms like Winglang and Darklang introduce new languages for coding applications and infrastructure together. Other solutions, like Ampt, Nitric, and Shuttle, introduce SDKs or annotation support for coding or defining infrastructure required in existing software languages like Golang and Rust.

The idea behind this approach is that a developer can declare the infrastructure required at the point where it is used. The code that reads and writes entries in a database specifies how to provision and configure the database. Code that writes messages to a queue creates the queue. An application declares the details for handling inbound network requests along with the business logic for handling those requests.

Infrastructure from Code does more than empower developers, it also ensures smooth alignment between building infrastructure and using it. Deployments often fail because of changes to one side or the other of infrastructure and application. Many of these failures can be avoided if the compiler or IDE immediately catches mismatches, like writes to an S3 bucket that hasn’t been created.

A major limitation of Infrastructure from Code is that it doesn’t address the concerns of infrastructure that isn’t directly used by a single application. Shared compute and networking resources; and platform services like monitoring and identity management need to be defined and managed separately. Environment management is also either hand-waved or tightly bound into application code. Enabling developers to manage application infrastructure is powerful, but most non-trivial systems have a broader scope.

Infrastructure as Model

There is an emerging crop of post-code infrastructure automation systems being developed by various startups like System Initiative and ConfigHub. These aim to remove the fiddliness of managing code in repositories. They also close the gap between live infrastructure and the representation that engineers use to define the changes they would like to make.

Infrastructure as Code uses code, whether declarative or imperative, to generate a model of the desired state, compares it with a model of the current state, and then makes changes to the existing infrastructure to converge it with the desired state. There are four states - code, desired state, model of current state, and actual state - can get out of sync in various ways that are at best confusing, and at worst lead to a broken state that is painful to correct.

Infrastructure as Model tools center on the model of current and desired state. They provide interfaces for defining the desired state. Demos focus on drag-and-drop GUI interfaces that demonstrate how much easier it is to compare desired and current state than is possible with traditional infrastructure code. But the real power with these tools is the programmable extensibility of the graph of desired and live state, and the events involved in building and converging them.

I confess that I don’t know exactly how Infrastructure as Model will end up being used in practice, but the possibilities are exciting. The implementations I’ve seen so far, as with using general purpose languages for Infrastructure as Code, tend to focus on improving the experience for infrastructure experts. I believe this is because the tools are still in early stages of development and are building the foundational functionality of assembling low-level cloud infrastructure into useful solutions. As Infrastructure as Model implementations evolve, they will hopefully address the higher-level concerns needed to make software delivery processes more effective.

GenAI as infrastructure assistant

Most of the approaches I’ve discussed so far give users a more convenient interface for working with low-level infrastructure resources, but they still require expertise and effort to select, configure, and deploy those resources in the most useful way. LLMs have the potential to support developers by providing this expertise.

Developers can use coding assistants like Github Copilot, Amazon CodeWhisperer, Cursor, or Pulumi AI to accelerate their use of old-school Infrastructure as Code. Infrastructure as Model platforms are already incorporating LLM support, and various tools like Firefly Copilot can use GenAI as a natural-language interface for building and managing infrastructure.

Imagine a developer describing their application’s networking requirements so an AI agent can provision and configure the necessary infrastructure resources. They can tell GenAI what they need to use an S3 bucket for, and it selects configuration options that seem appropriate.

If done poorly, tools like this will be AI-assisted ClickOps, leaving organizations with an unmaintainable mess of snowflake infrastructure. There are several challenges to getting reliable, useful results from AI-assisted infrastructure management.

First, the AI needs to understand what the user needs, which means the user needs to be able to explain what they need accurately, clearly, and comprehensively. GenAI will guess to fill in any gaps or ambiguity. As anyone who has worked with GenAI very much knows, it’s important to understand the domain very well, and to know how to craft and iterate on prompts, to avoid making a mess.

A GenAI-based infrastructure management solution could provide software teams with well-designed prompts and guardrails could help them to manage their own infrastructure. The work to provide those tools then looks very similar to providing infrastructure automation with a code-based toolchain, including being done by people with deep expertise in infrastructure. Call it “infrastructure as prompts”.

There are interesting possibilities for using LLMs for managing infrastructure. However, using them to improve the effectiveness of software delivery, as with the other approaches I’ve discussed in this article, requires going beyond optimizing the low-level work of defining infrastructure, whether using code or another interface.

Valuable progress, but not addressing the bigger issues

The solutions here, including developer-friendly infrastructure coding languages, interfaces that work directly with infrastructure state models, or adding LLM-based AI assistance, all target the low-level tasks of defining and deploying infrastructure resources. They can make those tasks easier for infrastructure experts or for developers.

But none of them are convincing as a solution that will lead to engineering leaders declaring that managing environments and platforms on the cloud is no longer a bottleneck for software delivery.

Yes, Infrastructure as Code is an awkward mechanism for assembling low-level resources into meaningfully useful services and environments for delivering and running software. But replacing code with a more convenient interface for infrastructure experts to use doesn’t address the largest friction points. The biggest gap is between low-level infrastructure details and the solutions needed by software teams.

I’m working on followup articles to explore that gap and how to address it. Some initial thoughts:

  • Consider the full value chain between infrastructure management and the value the organization delivers to its users. Who are the users and other stakeholders? Developers are key users, but there are other direct consumers of infrastructure, including testers, release managers, platform service engineers, and support staff, not to mention people who deploy and configure COTS software. Then there are stakeholders concerned with governance, compliance, security, and cost management.
  • What are the journeys involved in using infrastructure? Infrastructure platform and tool stories focus on building new infrastructure. But what about creating new environments? Providing environments for ad-hoc needs? Updating core systems, adding and replacing infrastructure services, troubleshooting, resizing, and optimizing?
  • How can we create an architecture that separates the concerns consumers have in defining and using infrastructure from how that infrastructure is defined at a low level, how it’s configured, deployed, and how changes are made to it?
  • And to wrap it up, how can we empower infrastructure consumers so that they have the kind of control they need over the infrastructure they use that will help them carry out their work easily, reliably, and with no friction? And how can infrastructure providers use their expertise to make sure that the infrastructure consumers use is built correctly, governed, secure, compliant, performance, cost-effective, and otherwise well-managed?

Image by Brett Jordan

Updated: