Infrastructure can be opaque: your cloud should be see-through

3 Mar

Imagine this situation. You have just been alerted to a critical security vulnerability in a piece of systems software embedded in thousands of physical and virtual servers across your on-premise technology estate. Your software provider has issued you with a patch, and you need to apply it as quickly as possible. Your business and technology stakeholders understand the gravity of the situation, and are willing to accept the disruption necessary for an emergency patching programme. It’s a race against time, between your ability to patch and the bad actors coming after your systems.

You have many problems in winning this race. Some of your systems don’t have automated testing in place, and you suspect that the patch will break at least a few of them. Many of your systems aren’t properly stateless, and restarting servers will disrupt their operations. Some of your really old systems don’t like being restarted at all, and will need careful attention.

But the biggest problem - and potentially the most embarrassing one - is that you don’t know where everything is or what state it is in. Your business stakeholders imagine that your technology estate is fully inventoried and catalogued, and that, as you roll the patch out, somewhere in your command centre, a row of lights will turn steadily from red to green.

In practice, every time you want to make a global upgrade to your estate, it’s a process of discovery. You have a patchwork of probes and agents deployed across your network, and you do your best to knit the information they provide into a coherent picture. Perhaps you have a service management and inventory system into which all this data is loaded. Perhaps you rely on a bunch of spreadsheets. (In practice, it’s probably a bit of both.)

But you know that, as you proceed with the rollout, your imaginary progress bar is going to jump up and down as you discover discrepancies in your data. That set of servers was actually decommissioned six months ago, but is still showing in the inventory. This local machine room sits behind a local firewall, and was invisible until you opened the port that the agent needed to communicate. Those servers are so out of date that they’re running an old version of the agent, and you need to patch the agent before you can patch the system.

You didn’t create this situation, and you certainly didn’t design it. It is a product of organic growth of your technology estate over time, of a steady shift from a few dozen technical assets that changed rarely, to a few hundred that changed occasionally, to thousands or tens of thousands that change continuously. It is a result of software upgrade cycles drifting from yearly major version upgrades, to regular patches, to patch Tuesdays, to patch every day.

It’s tempting to say that everything is better on Cloud, but that’s not true. It’s still possible to achieve an unruly and sprawling estate on Cloud. But there is one big, fundamental difference in nature which makes this kind of problem much easier to handle on Cloud, and that is the status of Cloud platforms as platforms.

On-premise estates are very rarely platforms: they are collections of infrastructure which have been acquired and integrated over time, using the best efforts of teams who are often over-worked and under-resourced, and who receive limited investment in systems and service management. Understanding this infrastructure is a constant process of discovery, even when you have deployed multiple agents and other tools. At best it is like exploring a dark room with a combination of strobe lights and echolocation: you can get flashes of clarity, but the picture is indistinct and quickly out of date.

By contrast, Cloud platforms are truly platforms. They have been designed from the ground up as integrated architectures which are software defined and explicitly declared, and which can be managed and interrogated through APIs. In an on-premise platform, the data collated from probes and agents and loaded into an inventory system is a representation of reality; on a Cloud platform, the data returned from APIs is reality. On Cloud, your APIs are your senses: it is like turning on the light in the dark room (and being able to see in frequencies beyond the visible spectrum).

There is a lot that the power of this data makes possible. To start with, it can make sure that you know what you have in your estate to support exercises such as emergency patching. It won’t do the patching for you (although some Cloud services can help with this problem, such as providing new, tested OS images quickly, and, for PaaS services, abstracting the underlying OS altogether), and won’t fix your applications that don’t have testing suites, which have strong affinity to server instances, or which don’t like being restarted. But at least you know what you have.

And that’s just the beginning: knowing what you have is useful, but is the most basic form of knowledge. In my last article, I wrote that the Cloud Operations team should be prominent and invisible at the same time. One of the ways to achieve this paradoxical goal is to analyse the data which is available from Cloud platform APIs and provide insights to the owners of assets. Example of how they can help include:

Cost management: on-premise, connecting costs to resources can be hard, and recharge mechanisms are often regarded as opaque and unfair. On Cloud platforms, it is possible to tie resource usage directly to costs (at least the Cloud platform consumption cost), and give end users the power to change this cost. Cloud Operations teams can provide instrumentation which helps their colleagues understand and optimise their cost.

Currency management: on-premise, as described above, it can be difficult to know what you have, let alone what state it is in. On Cloud platforms, you can understand exactly where you are - and how your level of currency conforms to your policies on risk and vulnerability management. You can even go so far as automatically shutting down assets that present an unacceptable risk to your organisation.

Reliability management: on-premise, incident management is usually a response to failure, and problem management is an attempt to understand the cause of that failure - if you get the time to do proper problem management at all. Data is typically analysed retrospectively and reactively. The Cloud Operations team has an opportunity to build a picture of reliability and failure modes across all their Cloud assets, and to work with teams to improve the reliability of the entire estate.

And that’s just the beginning. As with any data rich system, the first step is simply understanding what data you have and what it means. But we live in a world where, if you have data, you put it to work. There is much more that can be done with trustworthy, current data accessed through the APIs of a software defined, explicitly declared platform. What are you going to do with it?

cloudinfrastructurerisksecurityoperations

David Knott

Infrastructure can be opaque: your cloud should be see-through

How does a group of experts become a team?

Cloud operations should be visible and invisible at the same time