Technologists are always crying wolf (because of all the wolves)

13 Feb

Photo credit: Thomas Bonometti via Unsplash

The computer had failed. Unfortunately, it was the Apollo Guidance Computer (AGC), the machine that controlled the flight of a small, fragile spacecraft to the Moon and back. Fortunately, it wasn’t in space: it was on the ground, in a simulator.

Margaret Hamilton, the leader of the MIT team programming the AGC, often had to work weekends to meet the urgent schedule of the Apollo programme, and sometimes brought her daughter, Lauren, to work with her. Lauren liked to play in the simulator.

Somehow, while in simulated spaceflight, Lauren had caused the AGC to jettison all of its navigational data. When Hamilton investigated, she found that Lauren had told the computer to load program 01: the program that prepared the craft for launch. The computer did what it was told: it forgot all of the data about the simulated flight in progress, and reset as if it was sitting on the launchpad.

Hamilton realised that if the mission had been real, rather than a simulation, the Command Module would have been lost, drifting through space with no idea of where it was. She tried to persuade NASA to build safeguards and controls into the system, but they told her that they didn’t have time - and, besides, astronauts don’t make mistakes. All she could do was add a note to the manual: ‘Do not select program 01 during spaceflight.’

In the very next flight an astronaut made a mistake. Jim Lovell was part of the crew of Apollo 8, the first mission to orbit the Moon. On the way back to Earth, after several days in cramped conditions with little sleep, he was entering star positions into the computer. He was supposed to enter the program number, 23, and then the star whose position he wanted to record. On one of these cycles, though, instead of selecting program 23, he entered the number of the star first. It was number 01.

The computer behaved just as it had on the ground. It forgot all of its navigational data, and reset itself as if ready for launch. It took a tense half hour of manual observation, communication with Mission Control and careful data entry to reconstruct the data and bring the craft back under control - an experience similar to that which Lovell would have later when he commanded Apollo 13.

NASA agreed to let Hamilton and her team build more error handling into the AGC. It helped save the Moon landing when the computer became overloaded in the last minutes of Apollo 11’s descent.

This might seem like a cautionary tale from the early days of computing. Back then, it may have seemed reasonable that trained experts would not make mistakes and that computers would not go wrong. Today, surely, we know better.

And yet . . .

I believe that Hamilton’s experience is replicated today, in thousands - perhaps millions - of routine decisions about computer systems. Some of these decisions are deliberate and overt, but many more are passive and silent.

The deliberate decisions typically appear in the design and build phases of development. The architect asks the sponsor what level of availability they would like to have, and the sponsor naturally replies that they would like 100% availability. Then the architect shows them the cost, and they change their mind. Do we really need that level of redundancy? Do we really need to backup the data to a different location? And, as the system approaches launch, and time is crunched, they start to ask different questions. Do we really need to spend that much effort on testing? If it is coded properly, won’t it just work? The architect and the product manager try to explain everything that could go wrong, but it doesn’t seem real - unlike the time, money and resources which are leaking away.

However, the most dangerous choices are those which are not taken out loud. They are the implicit choices not to maintain currency, or to apply upgrades, or apply patches, or to sustain a team that can continuously improve a product. They are the choices which manifest in risk registers which are slowly turning red, but which are not used to drive action. Why spend time, effort and resources on something which does not appear to be broken?

Our challenge is that the business sponsor’s reasonable instincts often appear to be right - for a time. Systems run for remarkably long periods without failing. Attacks and breaches - and their consequences - may not immediately be apparent, or may never come to light at all. Disasters rarely strike - and when they strike, most frequently take the form of unspectacular power and network failures rather than floods and fires. It is easy to see why many business sponsors come to believe that the technologists are crying wolf.

But the wolves are real. Jim Lovell and the crew of Apollo 8 were unlucky, but their bad fortune was good for the Apollo programme. If they had not shown that there really was a wolf in the cockpit, then the problems on Apollo 11 may not have been anticipated - and the first Moon landing would have ended very differently.

As technologists, it is our job is to point out the wolves that other people can’t see: the errors and vulnerabilities in the code; the inevitability of hardware failure; and the consequences of disasters. To help our business sponsors see the wolves, we need to talk two languages.

First, we must speak the objective, quantitative language of risk management. Such language enables us to take rational decisions and make sensible compromises. It enables us to see that risk is a resource, just like time, money and people - and to figure out how to balance each of them.

Second, though, we must speak the language of stories. Numbers are powerful, but systems failures have real impacts on real lives. Explaining these impacts helps sponsors understand the consequences of their choices. There are many stories to tell - the story of how we once went to the Moon, and what we learnt on the way, is just one of them.

riskhistoryspace

David Knott

Technologists are always crying wolf (because of all the wolves)

The language illusion, doubled

Coping with volatility: don't panic; seek truth; release frequently