Last week I gave a talk at the operability.io conference. This was a great conference, small (about 170 people), single track. I met a lot of people I know, and a number of people who weren’t the usual DevOpsDays suspects. It had a strong focus on operations, with some excellent talks about organizing and running ops teams, as well as technical topics like logging and security. It probably leaned more towards people-oriented topics than technical-oriented ones.
My own talk, “Automating for Agility, was high level. I wanted to explore the importance of understanding and communicating the outcomes you expect to get from infrastructure automation. In my mind, there are two existential reasons for an organization to consider IT automation. One is to enable fast and continuous change. The other is to empower users of the infrastructure to achieve their goals.
I don’t believe most IT organizations today have either of these goals in mind. There are plenty who pay lip service to self-service for their users, but few who really deliver. In most cases, centralized platform and tool teams make decisions based on what is convenient for themselves, not for their users. They choose tools which help them, as a centralized team, have control over the solution.
O’Reilly has pushed the latest update to the “early access” (i.e. rough draft) of my book-in-progress, Infrastructure as Code. You can buy the book now, getting access to the early access book, and you’ll get the full electronic version of the final release.
This release has three new chapters, and updates to some of the earlier chapters.
Chapter 8 discusses patterns and practices for making changes to servers. This chapter is closely tied to Chapter 3, server management tools, which I had released previously. The process of writing the new chapter led me to reshape the earlier one, in order to get the right balance of which topics belong in which chapter.
Basically, the chapters in Part I, including chapter 3, are intended to lay out how the different types of tools work. The chapters in Part II, including chapter 8, get more into patterns and practices for using the tools as part of an “infrastructure as code” approach.
Aside from improving the structure of the content, this revision to these two chapters clarified for me the idea of four “models” for server change management: ad-hoc changes, continuous synchronization, immutable servers, and containerization.
The other two new chapters in this release kick off Part III of the book, which gets more into the meat of software development practices that are relevant to infrastructure. Chapter 9 describes software engineering practices, drawing heavily on XP (eXtreme Programming) concepts like CI (Continuous Integration). It also discussed practical topics like effective use of VCS (Version Control Systems), including branching strategies, and maintaining code quality and avoiding technical debt.
Chapter 10 is about quality management. This includes obvious topics like TDD (Test Driven Development), but also goes into change management processes, such as CABs (Change Advisory Boards), and structuring work as stories. This is intended to take a higher level view of quality than simply writing automated tests. Automated tests are only one element of a larger strategy for managing changes to infrastructure in a way that helps teams make frequent, rapid changes, while avoiding errors and downtime.
One of the things I’m learning from the process of writing a full-length book is how interrelated the different parts of the book are. Writing new chapters forces me to re-evaluate what I’ve written previously, and re-think how it all fits together.
A key part of the process of improving the book as I write it is feedback from readers. I’m interested in hearing what people who are new to the ideas in this book think. Does the stuff the book says make sense? Is it relevant? Helpful? Confusing? I’m also interested in input from people who’ve already lived infrastructure as code. Do the principles and practices I’ve laid out resonate with you? Do you have different experiences? Are there topics I’ve missed out?
My former colleague Martin Thompson borrowed the term Mechanical Sympathy from Formula One driver Jackie Stewart and brought it to IT. A successful driver like Stewart has an innate understanding of how his car works, so he can get the most out of it and avoid failures. For an IT professional, the deeper and stronger an understanding we have of how the system works down the stack and into the hardware, the more proficient we’ll be at getting the most from it.
The history of computing has been a continuing process of adding new layers of abstraction. Operating systems, programming languages, and now virtualization have each helped us to be more productive by simplifying the way we interact with computer systems. We don’t need to worry about which CPU register to store a particular value in, we don’t need to think about how to allocate heap to different objects to avoid overlapping them, and we don’t care which hardware server a particular virtual machine is running on.
Except when we do.
Hardware still lurks beneath our abstractions, and understanding what happens behind the facade of APIs and virtual CPU units is useful. It can help us to build systems that gracefully handle hardware failures, avoid hidden performance bottlenecks, and exploit potential sympathies - tweaks that make the software align with the underlying systems to work more reliably and effectively than naively written software would.
For example, the Netflix team knew that a percentage of AWS instances, when provisioned, will perform much worse than the average instance, whether because of hardware issues or simply because they happen to be sharing hardware with someone else’s poorly behaving systems. So they wrote their provisioning scripts to immediately test the performance of each new instance. If it doesn’t meet their standards, the script destroys the instance and tries again with a new instance.
Understanding the typical memory and CPU capacity of the hardware servers used by your platform can help to size your VMs to get the most out of them. Some teams choose their AWS instance size to maximize their chances of getting a hardware server entirely to themselves, even when they don’t need the full capacity.
Understanding storage options and networking is useful to ensure that disk reads and writes don’t become a bottleneck. It’s not a simple matter of choosing the fastest type of storage option available - selecting high performance local SSD drives may have implications for portability, cost, and even availability of resources.
This extends up and down the stack. Software and infrastructure should be architected, designed, and implemented with an understanding of the true architecture of the hardware, networking, storage, and the infrastructure management platform.
An infrastructure team should seek out and read through every whitepaper, article, conference talk and blog post they can find about the platform they’re using. Bring in experts from your vendor to review your systems, from high level architecture down to implementation details. Be sure to ask questions about how your designs and implementation will work with their physical infrastructure.
If you’re using virtualization infrastructure managed by your own organization then there’s no excuse not to collaborate. Make sure your software and infrastructure are designed holistically, and continuously measured and modified to get the best performance and reliability possible.
I’m excited to announce that an “Early Access” (Raw and Unedited) version of the Infrastructure as Code book is now available for purchase from the O’Reilly shop!
This includes the first three chapters of the book. People who buy the early access edition will get updates as I add more chapters and revise the existing ones. When the final version of the book is published, you’ll get the final ebook bundle.
I ran my current draft of the book through wordle and got this: