Reducing 2AM headaches Part 3: Resiliency

The title of this series underscores our motivation for building a toolbox for system management, silencing the pager. In the first part of the series, we discussed the importance of standardization. We then talked about automation.  As we conclude the series, we turn our focus on resiliency.

Operations management aims to keep failures to a minimum while increasing efficiency.  The systems we manage are complex chains of interconnected processes that sometimes fail.  The measure of how often these failures occur is known as the mean time between failures (MTBF). In a complex chained system, MTBF is combinatory; the mean for each component decreases the availability of the overall system.  Parallel systems, unlike chained systems, have higher availability if the system is considered available when one of the two parallel members is up.  Users want their MP3’s to play, their photos to print, their updates to reach their friends and families.  The “nines” expresses availability as an amount of allowable downtime over a time period; how long users will put up with no lights or dial tone.  “Two nines” would be 99% or about 4 days and “five nines” would be 99.999% or about 5 and half minutes a year.

Regardless of the system reliability math, the “belt and suspenders” mentality that pervades operations is based on the driving theme of uptime. Just about everything we do is to mitigate, prevent, manage or discover failures or failure-inducing conditions. Standard operating environments reduce the variables that could cause issues in practice or troubleshooting.  Automation reduces human interactions that can introduce drift.  Monitoring alerts us to changes from the expected system behaviors.  Written methods of procedure ensure a clear understanding of actions taken during a maintenance window.  Backups provide recovery from catastrophic failures.  High availability clustering provides parallel environments to increase our reliability.

Recognize that failures are inevitable. The only thing we can control is our response.

Changing our vantage point

I'd like to offer up a challenge: where we've been architecting around reliability, we should be building for recovery.  Mean time to recover (MTTR) is the measure of restoring service, which quite frankly, is more important than eliminating potential failure points.  Service Level Agreements (SLAs) and, perhaps more importantly, the corresponding penalties are based on downtime metrics, not uptime. This makes recovery time our keystone for designs, not reliability. A simple system with fewer components that fails once a month and takes five minutes to restore service has the same uptime 'score' as a highly complex fail-over architecture that accumulates twenty 30-second interruptions over the same period.

Taking the Conversation Off Road

Changing to a service resiliency model instead of a failure survivability model impacts the choices we make in architecture and our tooling.  Let's look outside the software arena and examine two prominent auto manufacturers who approach this idea in different ways: Jeep and Rolls Royce.

Designed for the US military, Jeep was mainly built using off-the-shelf automotive components as an all-terrain recon vehicle.  The design allowed for quick modification and repair so that a modern Jeep can be disassembled and reassembled by an Army drill team in under four minutes. Showy perhaps, but it gets to the core of the design. Given the use case, part failure is inevitable, so Jeeps need to be easily recovered.

On the other hand, the iconic, British luxury car Rolls Royce is designed around long duty cycles, for less harsh conditions. Specialized electronics, engine and interior components are built around increasing lifecycle reliability.  While this can result in long and expensive repairs, Rolls Royce has earned a brand reputation for the highest quality of the parts.

Software Engineering Tools

From the design table to the shop floor, it’s clear these two automotive icons have major differences.  For example, tolerances on parts and assembly, material choices, methods and tools used during assembly are widely varied based on the guiding choice of recovery versus reliability.  The automotive engineering analogy holds true in software engineering as well.  We care about different things if a single component failure in the chain doesn't bring our application to a grinding halt.  These design choices also change as new technologies emerge. For instance, virtualization provides new options for recovery and resiliency while cloud computing offers a similar, but distinct set of options and challenges.

Many of the tools remain the same, but their application to our design choices will change.  Monitoring will alert us to problem components that need to be investigated and returned to the pool of available resources.  Automated configuration management system can correct drift without human intervention.  Increases in load can be handled elastically in response to load balancers within pools of available resources.

There are other factors to be sure; inefficient processes for getting the right people involved probably wastes the most time in an outage.  But a recoverable system that allows techs to work around and repair the failed component without impact to service availability will tack on that extra 9 much faster.

Sounds like the cloud, neh?