David Knott 14/09/2023 David Knott 14/09/2023

Three predictions: probably wrong; possibly useful

Why make predictions about technology at all? I am writing this from a train which was predicted to be at its destination half an hour ago, but is still stuck between stations. If we can’t make accurate predictions about a well-known system with years worth of data, how could we possibly make predictions about the ever-changing field of technology?

Yet the process of attempting to make predictions tells us something, even if individual predictions are wrong. This particular train might not arrive as predicted, but I expect that there is a model somewhere which predicts the overall number of trains that will be delayed - and this delay may be consistent with that prediction. Somebody had to develop that model, and that process will have found something interesting about the factors that affect the reliability of trains.

Similarly, while it’s hard to make predictions about the future of technology, the process of attempting to figure out what’s coming next can be useful. It forces us to use our imaginations, to think about the consequences of choices we are making today, and to recognise the limitations of our own knowledge.

David Knott 24/08/2023 David Knott 24/08/2023

Which is more scary: a disaster or your disaster recovery plan?

Most organisations above a certain size have technology disaster recovery plans: plans for what they will do when something goes wrong, such as a fire, flood or power failure. These plans are often elaborate and expensive, involving redundant equipment and facilities in geographically separated locations. Organisations don’t run pairs of data centres because they enjoy running data centres: they do so against the day when one of the data centres isn’t there any more.

However, despite all of this preparation and expense, many organisations have two problems which mean that their disaster recovery plans may be no use in a real disaster. First, their disaster recovery plans aren’t really plans to recover from disasters. Second, their disaster recovery plans are just plans.

David Knott 11/08/2022 David Knott 11/08/2022

Not just someone else’s computer: to understand cloud, go back to the start

Do you believe that the cloud is just somebody else’s computer? If so, then I have to disagree with you.

I could say that this is because a computer is just a machine, whereas a cloud is a fully architected, software defined, API managed platform, that is at least as much software as it is hardware. However, I think that we can find a more interesting answer by going back to the origins of cloud.

Most on-premise computing architectures have grown organically over decades, and include principles, patterns, components and capabilities from other eras. The code running on that mainframe at the heart of your estate may have been written before it was normal for systems to run across large numbers of machines in parallel. The processes to procure, configure and manage servers may have been created before we realized that dev and ops belong together as a shared set of accountabilities. The security measures that protect your assets may have been implemented before it was the standard to connect most of your systems to a global public network.

David Knott 21/07/2022 David Knott 21/07/2022

Do you need an umbrella or a lifeboat?

How do you prepare for things that just keep on going wrong? And how do you prepare for the day when everything goes wrong?

Last week I attempted to distinguish between reliability and resilience, claiming that reliability is the ability to keep services running despite routine failures, while resilience is the ability to restore essential services despite unexpected catastrophes. But that basic definition is not quite enough to disentangle these related but distinct topics. In this article, I’ll explore three more important differences between reliability and resilience.

Reliability protects normality; resilience strives for survival

David Knott 14/07/2022 David Knott 14/07/2022

Do you know the difference between reliability and resilience?

If you want to know the difference between reliability and resilience, look to the Moon. Specifically, look at the two best known Moon missions, Apollo 11 and Apollo 13.

Although Apollo 11 was famously successful, this was almost not the case. In the final minutes of the descent, the guidance computer crashed repeatedly, throwing error after error at the two astronauts, who waited tensely on instructions from Mission Control, and wondered whether to abort the mission.

Later, it was found that the computer was receiving unexpectedly large amounts of data from one of the instruments, overloading its memory and processing capacity. Yet, despite the nerve wracking series of errors, the computer behaved exactly as designed. When overwhelmed, it displayed an error message and restarted itself, giving priority to the most important programmes.

David Knott 14/02/2019 David Knott 14/02/2019

Planning for a cloudy day

On 3rd June 1979, the American Institute of Architects held their annual national conference at the Kemper Arena in Kansas City, a building they had honoured with an award.

Less than 24 hours later, the roof collapsed. Fortunately, no-one was hurt.

The cause of the collapse was partly due to excessive rainfall during a major storm. The roof has been built with rain in mind, with a drainage system designed to release water gradually into the sewer system to avoid overwhelming it. What it had not been designed for, however, was the day when there was so much rain that the sewer system was already overwhelmed, causing water to back up onto the roof. Water is heavy, and the additional weight, coupled with high winds, caused a supporting bolt to give way, triggering a series of further failures and the collapse of the roof.

David Knott 22/11/2018 David Knott 22/11/2018

From design conflict to design harmony on cloud

The dome of St Paul’s is an iconic part of the London skyline: it’s instantly recognisable, and can be seen from far away.

What’s harder to see, though, is that the famous building is also a famous architectural compromise. Sir Christopher Wren’s original plan was for a floorplan shaped like a symmetrical cross, in line with the classical inspiration for his design. The clergy disagreed, though, and insisted on the more traditional layout for an English church, with an extended nave. As a result, we have a classical dome on top of a medieval floorplan.

Most design involves compromises like this: we have design goals (build a beautiful structure following classical ideals; demonstrate visual continuity with church tradition) which cannot be satisfied simultaneously, and we have to figure out which goals we will give priority to.

One of the things that attracts me to Cloud, though, is that design goals which have traditionally conflicted can be brought into harmony.