Do you need an umbrella or a lifeboat?
How do you prepare for things that just keep on going wrong? And how do you prepare for the day when everything goes wrong?
Last week I attempted to distinguish between reliability and resilience, claiming that reliability is the ability to keep services running despite routine failures, while resilience is the ability to restore essential services despite unexpected catastrophes. But that basic definition is not quite enough to disentangle these related but distinct topics. In this article, I’ll explore three more important differences between reliability and resilience.
Reliability protects normality; resilience strives for survival
We build reliability into systems because we know that things will go wrong: servers will overheat, discs will fail, software will crash. These are not things that might happen: they are things that will happen. The aim of reliability is to build systems which keep going despite these routine failures: failures like these are normal, and the service we provide to customers should be normal, despite these failures.
We build resilience into systems because we know that things might go wrong in big, bad ways. We might suffer a data centre failure and, when we try to fail over to the other data centre, find that the failover doesn’t work. Our cloud provider might suffer a control plane failure which takes out the platform - to the extent that we don’t know when it’s coming back. Our network perimeter may have been breached by an attacker who has spread ransomware across our infrastructure - and it has just been activated. In these scenarios, it is impossible to maintain normality: someone is going to notice that something has gone wrong. Indeed, we must tell people that something has gone wrong, and what services they can expect while we respond.
If we think of all types of failure as rare and exceptional, then we compromise reliability: we bolt on responses to failure, rather than build them in. If we think that we must preserve normality in every possible scenario, then we seek a level of resilience which is impossible to achieve, and which might never be needed.
Reliability gives you an umbrella: a way to go about your daily business despite the rain. Resilience gives you a lifeboat: a way to survive despite the storm.
Reliability should be routine; resilience requires imagination
The failures which we address through reliability are familiar and predictable. We may not know exactly which server, network component or storage array will fail, or exactly when it will fail, but we know that they will fail. That’s what they do. This means that we can develop highly codified responses to these failures, and build them into our infrastructure and applications.
By contrast, the types of failure which we address through resilience are highly unpredictable. For example, although we known that, despite our best efforts, data centres sometimes lose power, this does not happen so often that we have enough data to predict its frequency. We have even less data for more extreme and exotic forms of failure, such as simultaneous data centre outages or disruption to national infrastructure. This does not mean that we can ignore such eventualities: rather, it means that we must be imaginative about what could go wrong, and how our resilience measures would help us respond in each situation.
Reliability needs testing; resilience requires rehearsal
Because reliability addresses failures that happen all the time, we should deal with it in the same way that we deal with anything that happens all the time: we should write some code. Companies aiming to test reliability often consider adopting the practice of chaos engineering: using tools to trigger failures within their environments to see how they cope. However, most traditional companies conclude that they are not ready for such a practice: they don’t yet have sufficient confidence in their reliability (or how they would explain deliberately breaking something in production). Fortunately, they don’t have to start with such an advanced capability: it’s possible (especially in software defined, API managed environments such as public cloud) to test for common modes of failure in your deployment pipeline. And, just like any code, if we build a certain type of behaviour (the ability to operate in the event of a routine failure), but don’t test it, then our job is not done.
By contrast, because resilience addresses major failures which rarely occur, and which will often be unique in nature, it’s not possible to code tests. Even if we could, these tests would only address the behaviour of systems, not the behaviour of people. And, in a genuine catastrophe, it is the behaviour of people that matters most.
The ransomware that has been lurking in our network for months has been activated. Do we pay the ransom? Our capacity has been cut by two thirds and we can’t run all services. Which services do we keep running? Whatever we do, there is going to be an impact on our customers. What do we tell them? And when? And how?
If you are figuring out the answers to these questions in the middle of the failure, then you will probably make mistakes. It’s not possible to anticipate every scenario, but regularly rehearsing scenarios builds the human capability to respond, and makes it easier to resist panic.
Together, these differences between reliability and resilience teach us one final, important lesson: although reliability depends on characteristics of underlying platforms and infrastructure, it can largely be addressed within the boundaries of a team. You can do your best to ensure that your service continues to respond to API calls, continues to process data, or continues to present an interface to customers. Resilience, by contrast, is an attribute of enterprises, not teams. You need to decide at the enterprise level what matters most, who gets to make decisions, what level of risk you will accept, what investments you will make, and who steps back and who steps forward in the event of a catastrophe.
Reliability is an engineering problem; resilience is a strategic need. Both were treated as afterthoughts in enterprise computing for many years, until we became so dependent on computer systems that they could not be ignored. The good news is that reliability and resilience turn out to be important, interesting and challenging problems - as long as we remember the difference between them.