Mechanical Sympathy and the Cloud

My former colleague Martin Thompson borrowed the term Mechanical Sympathy from Formula One driver Jackie Stewart and brought it to IT. A successful driver like Stewart has an innate understanding of how his car works, so he can get the most out of it and avoid failures. For an IT professional, the deeper and stronger an understanding we have of how the system works down the stack and into the hardware, the more proficient we’ll be at getting the most from it.

The history of computing has been a continuing process of adding new layers of abstraction. Operating systems, programming languages, and now virtualization have each helped us to be more productive by simplifying the way we interact with computer systems. We don’t need to worry about which CPU register to store a particular value in, we don’t need to think about how to allocate heap to different objects to avoid overlapping them, and we don’t care which hardware server a particular virtual machine is running on.

Except when we do.

Hardware still lurks beneath our abstractions, and understanding what happens behind the facade of APIs and virtual CPU units is useful. It can help us to build systems that gracefully handle hardware failures, avoid hidden performance bottlenecks, and exploit potential sympathies - tweaks that make the software align with the underlying systems to work more reliably and effectively than naively written software would.

Hardware like CPUs and hard drives appear local to the VM, but may not be what they seem.

For example, the Netflix team knew that a percentage of AWS instances, when provisioned, will perform much worse than the average instance, whether because of hardware issues or simply because they happen to be sharing hardware with someone else’s poorly behaving systems. So they wrote their provisioning scripts to immediately test the performance of each new instance. If it doesn’t meet their standards, the script destroys the instance and tries again with a new instance.

Understanding the typical memory and CPU capacity of the hardware servers used by your platform can help to size your VMs to get the most out of them. Some teams choose their AWS instance size to maximize their chances of getting a hardware server entirely to themselves, even when they don’t need the full capacity.

Understanding storage options and networking is useful to ensure that disk reads and writes don’t become a bottleneck. It’s not a simple matter of choosing the fastest type of storage option available - selecting high performance local SSD drives may have implications for portability, cost, and even availability of resources.

This extends up and down the stack. Software and infrastructure should be architected, designed, and implemented with an understanding of the true architecture of the hardware, networking, storage, and the infrastructure management platform.

An infrastructure team should seek out and read through every whitepaper, article, conference talk and blog post they can find about the platform they’re using. Bring in experts from your vendor to review your systems, from high level architecture down to implementation details. Be sure to ask questions about how your designs and implementation will work with their physical infrastructure.

If you’re using virtualization infrastructure managed by your own organization then there’s no excuse not to collaborate. Make sure your software and infrastructure are designed holistically, and continuously measured and modified to get the best performance and reliability possible.

Early Release is available!

I’m excited to announce that an “Early Access” (Raw and Unedited) version of the Infrastructure as Code book is now available for purchase from the O’Reilly shop!

This includes the first three chapters of the book. People who buy the early access edition will get updates as I add more chapters and revise the existing ones. When the final version of the book is published, you’ll get the final ebook bundle.

Wordle of the current draft

I ran my current draft of the book through wordle and got this:

Wordle of the book

The Automation Fear Spiral

At an open spaces session on configuration automation at a DevOpsDays a year or two ago, I asked the group how many of them were using an automation tools like Puppet or Chef. The majority of hands went up. I asked how many were running these tools unattended, on an automatic schedule. Quite a few of the hands went down.

Many people have the same problem I had in my early days of using automation tools. I used automation selectively, for example to help build new servers, or to make a specific configuration change. I tweaked the configuration each time I ran it, to suit the particular task I was doing.

I was afraid to turn my back on my automation tools, because I lacked confidence in what they would do.

I lacked confidence in my automation because my servers were not consistent.

My servers were not consistent because I wasn’t running automation frequently and consistently.

Automation fear spiral

This is the automation fear spiral. Infrastructure teams need to break this spiral to use automation successfully.

The most effective way to break the spiral is to face your fears. Pick a subset of your servers, choose one or two things on them to put under configuration management, and then schedule these to be applied unattended, at least once an hour.

Starting small and simple helps to limit the risk. The important thing is that you take the leap - let the automation go on its own, without you holding its hand. Then you can increase the scope - add a few more parts of the server to the automation, and extend it to more servers.

You should also build out other things that will give you more confidence, piece by piece. Add monitoring to detect when the thing you’ve automated goes wrong. Set up Continuous Integration to automatically test the configuration change every time you commit. Again, getting these pieces in place with a small, simple set of configuration, creates the platform. You can then extend the coverage with confidence.

What is Infrastructure as Code?

Systems administrators have probably been writing scripts and tools to make their jobs easier since day one. When I joined my first systems team, the mantra “let the machines do the boring work” was drilled into my head. The popularization of cloud (IaaS) and virtualized infrastructure over the past decade has given us the opportunity to automate the boring work in some very different, and cool ways.

We have moved from the Iron Age of infrastructure and into the Cloud Era. The resources of an IT infrastructure - compute, storage, and networking - have been transmogrified from inflexible physical hardware into dynamic software and data. In the Iron Age, the interface for creating a new server was a purchasing request form. In the Cloud Era, it’s a programmable API.

So. If our infrastructure is now software and data, manageable through an API, then this means we can bring tools and ways of working from software engineering and use them to manage our infrastructure. This is the essence of Infrastructure as Code.

In software engineering we write code, keep it in a Version Control System (VCS) and automatically test it with Continuous Integration (CI). We and deploy it to a series of environments using a Deployment Pipeline so it can be validated before putting it into production use. Now we can easily do the same with infrastructure.

Infrastructure as Code doesn’t only apply to cloud and virtualized infrastructures, it can also be used with “bare metal” infrastructure. I’ve worked with teams who’ve implemented dynamic hardware system management. We used Cobbler to automatically install servers using PXE-boot installation over a network. When we implemented this we had to start the installation at the physical server (boot the server and hold down F12, in our case). But many hardware vendors have Lights Out Management functionality that makes it possible to do this remotely, and even automatically.