Update to early access Infrastructure as Code book
O’Reilly has pushed the latest update to the “early access” (i.e. rough draft) of my book-in-progress, Infrastructure as Code. You can buy the book now, getting access to the early access book, and you’ll get the full electronic version of the final release.
This release has three new chapters, and updates to some of the earlier chapters.
Chapter 8 discusses patterns and practices for making changes to servers. This chapter is closely tied to Chapter 3, server management tools, which I had released previously. The process of writing the new chapter led me to reshape the earlier one, in order to get the right balance of which topics belong in which chapter.
Basically, the chapters in Part I, including chapter 3, are intended to lay out how the different types of tools work. The chapters in Part II, including chapter 8, get more into patterns and practices for using the tools as part of an “infrastructure as code” approach.
Aside from improving the structure of the content, this revision to these two chapters clarified for me the idea of four “models” for server change management: ad-hoc changes, continuous synchronization, immutable servers, and containerization.
The other two new chapters in this release kick off Part III of the book, which gets more into the meat of software development practices that are relevant to infrastructure. Chapter 9 describes software engineering practices, drawing heavily on XP (eXtreme Programming) concepts like CI (Continuous Integration). It also discussed practical topics like effective use of VCS (Version Control Systems), including branching strategies, and maintaining code quality and avoiding technical debt.
Chapter 10 is about quality management. This includes obvious topics like TDD (Test Driven Development), but also goes into change management processes, such as CABs (Change Advisory Boards), and structuring work as stories. This is intended to take a higher level view of quality than simply writing automated tests. Automated tests are only one element of a larger strategy for managing changes to infrastructure in a way that helps teams make frequent, rapid changes, while avoiding errors and downtime.
One of the things I’m learning from the process of writing a full-length book is how interrelated the different parts of the book are. Writing new chapters forces me to re-evaluate what I’ve written previously, and re-think how it all fits together.
A key part of the process of improving the book as I write it is feedback from readers. I’m interested in hearing what people who are new to the ideas in this book think. Does the stuff the book says make sense? Is it relevant? Helpful? Confusing? I’m also interested in input from people who’ve already lived infrastructure as code. Do the principles and practices I’ve laid out resonate with you? Do you have different experiences? Are there topics I’ve missed out?
Mechanical Sympathy and the Cloud
My former colleague Martin Thompson borrowed the term Mechanical Sympathy from Formula One driver Jackie Stewart and brought it to IT. A successful driver like Stewart has an innate understanding of how his car works, so he can get the most out of it and avoid failures. For an IT professional, the deeper and stronger an understanding we have of how the system works down the stack and into the hardware, the more proficient we’ll be at getting the most from it.
The history of computing has been a continuing process of adding new layers of abstraction. Operating systems, programming languages, and now virtualization have each helped us to be more productive by simplifying the way we interact with computer systems. We don’t need to worry about which CPU register to store a particular value in, we don’t need to think about how to allocate heap to different objects to avoid overlapping them, and we don’t care which hardware server a particular virtual machine is running on.
Except when we do.
Hardware still lurks beneath our abstractions, and understanding what happens behind the facade of APIs and virtual CPU units is useful. It can help us to build systems that gracefully handle hardware failures, avoid hidden performance bottlenecks, and exploit potential sympathies - tweaks that make the software align with the underlying systems to work more reliably and effectively than naively written software would.
For example, the Netflix team knew that a percentage of AWS instances, when provisioned, will perform much worse than the average instance, whether because of hardware issues or simply because they happen to be sharing hardware with someone else’s poorly behaving systems. So they wrote their provisioning scripts to immediately test the performance of each new instance. If it doesn’t meet their standards, the script destroys the instance and tries again with a new instance.
Understanding the typical memory and CPU capacity of the hardware servers used by your platform can help to size your VMs to get the most out of them. Some teams choose their AWS instance size to maximize their chances of getting a hardware server entirely to themselves, even when they don’t need the full capacity.
Understanding storage options and networking is useful to ensure that disk reads and writes don’t become a bottleneck. It’s not a simple matter of choosing the fastest type of storage option available - selecting high performance local SSD drives may have implications for portability, cost, and even availability of resources.
This extends up and down the stack. Software and infrastructure should be architected, designed, and implemented with an understanding of the true architecture of the hardware, networking, storage, and the infrastructure management platform.
An infrastructure team should seek out and read through every whitepaper, article, conference talk and blog post they can find about the platform they’re using. Bring in experts from your vendor to review your systems, from high level architecture down to implementation details. Be sure to ask questions about how your designs and implementation will work with their physical infrastructure.
If you’re using virtualization infrastructure managed by your own organization then there’s no excuse not to collaborate. Make sure your software and infrastructure are designed holistically, and continuously measured and modified to get the best performance and reliability possible.
Early Release is available!
I’m excited to announce that an “Early Access” (Raw and Unedited) version of the Infrastructure as Code book is now available for purchase from the O’Reilly shop!
This includes the first three chapters of the book. People who buy the early access edition will get updates as I add more chapters and revise the existing ones. When the final version of the book is published, you’ll get the final ebook bundle.
Wordle of the current draft
I ran my current draft of the book through wordle and got this:
The Automation Fear Spiral
At an open spaces session on configuration automation at a DevOpsDays a year or two ago, I asked the group how many of them were using an automation tools like Puppet or Chef. The majority of hands went up. I asked how many were running these tools unattended, on an automatic schedule. Quite a few of the hands went down.
Many people have the same problem I had in my early days of using automation tools. I used automation selectively, for example to help build new servers, or to make a specific configuration change. I tweaked the configuration each time I ran it, to suit the particular task I was doing.
I was afraid to turn my back on my automation tools, because I lacked confidence in what they would do.
I lacked confidence in my automation because my servers were not consistent.
My servers were not consistent because I wasn’t running automation frequently and consistently.
This is the automation fear spiral. Infrastructure teams need to break this spiral to use automation successfully.
The most effective way to break the spiral is to face your fears. Pick a subset of your servers, choose one or two things on them to put under configuration management, and then schedule these to be applied unattended, at least once an hour.
Starting small and simple helps to limit the risk. The important thing is that you take the leap - let the automation go on its own, without you holding its hand. Then you can increase the scope - add a few more parts of the server to the automation, and extend it to more servers.
You should also build out other things that will give you more confidence, piece by piece. Add monitoring to detect when the thing you’ve automated goes wrong. Set up Continuous Integration to automatically test the configuration change every time you commit. Again, getting these pieces in place with a small, simple set of configuration, creates the platform. You can then extend the coverage with confidence.