Cloud operations should be visible and invisible at the same time
Do you need a Cloud Operations team at all?
I’ve heard many companies ask this question, and believe that it is a reasonable question to ask. After all, for many people, the attraction of Cloud is to avoid the frustration they associate with central teams. Development teams who were dependent on overworked, under-resourced and under-automated infrastructure teams can now take direct accountability for the infrastructure themselves. Project managers who used to have to put long lead times for procurement, commissioning and configuration of infrastructure on their critical path can now drop those tasks from the plan. If everything is software now, and the DevOps team takes accountability for operations, why do we need another operations team?
Yet, when companies attempt to adopt Cloud at scale without creating a Cloud Operations team, they usually struggle. Cloud platforms are highly configurable, software defined constructs managed through APIs. It requires expertise, experience and work to make good choices about how to use these platforms and, without a Cloud Operations team, this work is done over and over again. Furthermore, like all technology platforms, Cloud platforms suffer incidents, and the incident boundary does not always neatly conform to the boundary of a product or applications owned by a DevOps team.
I believe that Cloud Operations teams are needed, but also believe that companies need to think hard when building such teams, to avoid creating bottlenecks and queues, and recreating the wait times and frustration they are used to. Here are four thoughts on attributes for such a team.
(A quick note on terminology: the team that operates Cloud on behalf of the rest of the company is new enough that the industry has not settled on a standard name. Sometimes it is called something like ‘the Cloud Centre of Excellence’, sometimes it is just called ‘the Cloud team’. I prefer ‘Cloud Operations’ because I think it describes what the team actually does.)
Prominent but Invisible
When everything is running smoothly, the Cloud Operations team should be invisible. DevOps teams should interact directly with the APIs of their chosen Cloud platform, with no human mediation. There should be no tickets, no queues and no waiting.
However, it takes a while to achieve this state of smooth running. In the early days of Cloud adoption, the company will be learning. They will have to make choices about policies, about risks, about standards and tools. They will have to balance autonomy and empowerment. They will need to acquire expertise and skills. And the Cloud Operations team should be prominent in all this work, visibly owning it and leading it to completion.
Furthermore, most companies, especially large companies, don’t achieve this state of smooth running all at once, across the entire enterprise. They will typically have teams that are confident, competent and capable on Cloud, running alongside teams who have yet to make their first Cloud deployment. (As William Gibson supposedly said, ‘The future is already here. It’s just not evenly distributed yet.’) These teams need help, and help is one of the things that the Cloud Operations team can provide. They can help people find the right training, explain core Cloud concepts to them, educate them in the constraints and policies that apply in their particular industry. For such teams, the Cloud Operations team should be a prominent coach - until they’re not needed any more.
Product and Customer Focused
In order to achieve this balance between prominence and invisibility, between being helpful by showing up and being helpful by fading into the background, the Cloud Operations team needs to operate as a product team with customers.
This may sound obvious: the shift from project work to product ownership is well established in the world of software development. However, it is typically less well established in operations teams within large organisations: under budget and resource pressures, such teams usually optimise for internal efficiency rather than customer experience. And optimising for internal efficiency usually results in queues, wait times, bottlenecks and constrained choices.
From the outset, the Cloud Operations should be clear on who their customers are: the DevOps and Data Science teams who directly use their platform. They should focus on the experience, advocacy and productivity of these customers. They will experience budget and resource constraints, just like everyone else, but if they are providing valued services to these customers, they can also be confident of advocacy and support.
Empowered and Autonomous
Just like any other product team, the Cloud Operations team should be empowered to provide their product in the way that works best for their customers. They should contain all the skills necessary to operate autonomously, including specialist capabilities such as security, risk management and financial management and planning.
Again, this might sound obvious, but many companies’ default approach is to treat the Cloud team as Just Another Infrastructure Team, and knit them into the processes, policies and procedures which govern other elements of their infrastructure - while telling them that to get specialist help, they need to take a ticket and join the queue like everybody else.
The Cloud Operations team shouldn’t disregard existing processes, policies and procedures entirely: as I wrote last week, these can all be valuable sources of institutional wisdom. But if you don’t give a product team the freedom to act like a product team, then it should be no surprise when the customers are unhappy.
Data Driven
One of the features of Cloud whose impact is often underestimated is the degree to which it makes infrastructure transparent. For many years, we have become used to enterprise infrastructure being opaque: we have had to build and deploy probes, inventory management systems and cataloguing tools just to find out what we have sitting in our data centres. And even then, we are never entirely sure that the data is complete.
We should never forget that Cloud platforms are explicitly declared, software defined platforms managed through APIs: if we want to know what we have, we only have to ask.
This means that Cloud Operations teams can have visibility across their entire Cloud implementation in a way that traditional infrastructure operations teams have rarely been able to achieve. And they have a responsibility to use this data to help everyone they serve: to optimise costs and resources, to help plan capacity, to manage performance, to gain intelligence about exposures and vulnerabilities - and whether they are being addressed effectively.
I believe that the purpose, placement, organisation and leadership of the Cloud Operations team are some of the most important technology choices that a company can make when moving to Cloud. It’s easy to get these choices wrong - but we stand a better chance of getting them right if we seek a team which is prominent but invisible, customer and product focused, empowered and autonomous, and data driven.