- AI
- ambiguity
- APIs
- architecture
- augmented reality
- books
- bureaucracy
- career
- change
- Christmas
- cloud
- collaboration
- communication
- complexity
- computer history
- corporate life
- data
- decisions
- delivery
- devops
- end user tools
- ethics
- failure
- fear
- fundamentals
- gaming
- government
- halloween
- history
- humans
- hype
- identity
- infrastructure
- innovation
- language
- leadership
- learning
- legacy
- management
- measurement
- mental health
- money
- networking
- New Year
- operations
- philosophy
- physics
- platforms
- prediction
- process
- procurement
- programming
- quantum
- reliability
- resilience
- risk
- robotics
- science
- science fiction
- security
- shadow IT
- space
- standards
- strategy
- talent
- teams
- technical debt
- technology advocacy
- testing
- thinking
- transformation
- TV
- virtues
- vision
- writing
Test environments are from Mars; production environments are from Venus
If a crewed mission makes it to Mars, it will have lots of problems to overcome. The atmosphere is thin, and mostly made of carbon dioxide. The average temperature is -60 degrees Celsius. And there is no magnetic field to shield inhabitants from cosmic radiation. It’s not surprising that, so far, the planet is occupied by robots rather than people.
What about our other planetary neighbour, Venus? It’s slightly easier to get to, has more atmosphere, and is quite a bit warmer. Unfortunately, it’s warmer by about 700 degrees, with an average temperature of 640 degrees Celsius. And that atmosphere is a bit too thick, imposing a crushing pressure 90 times that of Earth. It’s mostly carbon dioxide too, but punctuated by clouds of sulphuric acid gas. And it doesn’t have a magnetic field – or a robot population. The only probes which have entered the Venusian atmosphere have been swiftly crushed and cooked.
Infrastructure can be opaque: your cloud should be see-through
Imagine this situation. You have just been alerted to a critical security vulnerability in a piece of systems software embedded in thousands of physical and virtual servers across your on-premise technology estate. Your software provider has issued you with a patch, and you need to apply it as quickly as possible. Your business and technology stakeholders understand the gravity of the situation, and are willing to accept the disruption necessary for an emergency patching programme. It’s a race against time, between your ability to patch and the bad actors coming after your systems.
You have many problems in winning this race. Some of your systems don’t have automated testing in place, and you suspect that the patch will break at least a few of them. Many of your systems aren’t properly stateless, and restarting servers will disrupt their operations. Some of your really old systems don’t like being restarted at all, and will need careful attention.
Cloud operations should be visible and invisible at the same time
Do you need a Cloud Operations team at all?
I’ve heard many companies ask this question, and believe that it is a reasonable question to ask. After all, for many people, the attraction of Cloud is to avoid the frustration they associate with central teams. Development teams who were dependent on overworked, under-resourced and under-automated infrastructure teams can now take direct accountability for the infrastructure themselves. Project managers who used to have to put long lead times for procurement, commissioning and configuration of infrastructure on their critical path can now drop those tasks from the plan. If everything is software now, and the DevOps team takes accountability for operations, why do we need another operations team?
Cloud leadership: the Operator
Why does it always happen at three o’clock in the morning? In all the conversations I’ve had about resilience, recovery, incident management, problem management and, latterly, Site Reliability Engineering, we always seem to end up asking the question, ‘What do we do when everything goes wrong at three o’clock in the morning?’
In the days of always-available mobile customer experience, it might seem more relevant to worry about what goes wrong in the middle of the day, when all of your customers are awake. But I think that the iconic 3:00 am failure still serves two purposes.
The first purpose is procedural and technical: it makes us think about how we work when the (virtual) offices are closed and nobody is around. How do we detect failures? How do we respond to them? How do we contact the people who aren’t awake? How do we reduce our dependence on those people?