Cloud leadership: the Operator
Why does it always happen at three o’clock in the morning? In all the conversations I’ve had about resilience, recovery, incident management, problem management and, latterly, Site Reliability Engineering, we always seem to end up asking the question, ‘What do we do when everything goes wrong at three o’clock in the morning?’
In the days of always-available mobile customer experience, it might seem more relevant to worry about what goes wrong in the middle of the day, when all of your customers are awake. But I think that the iconic 3:00 am failure still serves two purposes.
The first purpose is procedural and technical: it makes us think about how we work when the (virtual) offices are closed and nobody is around. How do we detect failures? How do we respond to them? How do we contact the people who aren’t awake? How do we reduce our dependence on those people?
Second, it has an emotional purpose: it reminds us that, however global our company is, however distributed our operations are, there will be some lonely stretches of the night when the enterprise depends on a small number of people. Those people are not always thanked or recognised - but we should be grateful to them.
The person who worries most about the 3:00 am failure is one of the seven key leadership roles for Cloud transformation: the Operator. The Operator has one of the most critical and difficult jobs in the enterprise before Cloud transformation - and one of the most critical and difficult jobs during transformation.
Many Cloud transformation programmes don’t define the role of the Operator well at the beginning. Sometimes they imagine that, between their Cloud provider and their DevOps teams, there is no more work for the Operator and their team to do. Sometimes they imagine the opposite: that the job of the Operator is to take all of those tools, practices and processes which took so much effort to implement on-premise, and replicate them on Cloud.
It doesn’t take too much imagination to recognise both of these mistakes.
It quickly becomes apparent that, whatever services the Cloud provider offers, and however autonomous DevOps teams are, there is genuine operations and engineering work to be done at the platform level to present and manage Cloud for the enterprise. For example, enterprise level policies and controls cannot be defined and implemented separately by every team - and if every team attempts to do so, they will be duplicating work.
It also quickly becomes apparent that many of the tools, processes and practices that made sense on-premise do not make sense on Cloud. For example, the work of discovery agents and probes has little place when everything is software defined and discoverable through APIs.
It does take more imagination, though, to recognise the bigger role of the Operator. The Operator must do more than create a coherent technical operations environment - they must also create a coherent and compelling proposition for their people. I believe that the role of expert operators on Cloud can be exciting and fulfilling, especially when combined with continuous learning. However, I also recognise that operations team often have the most difficult journey and experience the most disruption of all teams affected by Cloud transformation. It is an essential role of the Operator to help operations teams traverse this journey, to define the meaningful work which they will continue to do, and invest in their skills and their careers. The Operator has the opportunity to create a Cloud Centre of Excellence: a powerful enabling function, which is conspicuous and apparent when it needs to provide help, but becomes invisible whenever possible, allowing teams to become self-sufficient.
And, after all that, the Operator still needs to figure out what happens when everything goes wrong at three o’clock in the morning.