From the data lake to the front page of data
How many times have you tried to build a data lake? Or, to put it another way, how many times have you tried to solve the analytics problem? I have to confess that I have tried more than once - and had varying success, until I learnt to think about the problem differently.
The problem is familiar: our enterprise has data which we believe to have value; that data is sitting in systems of record which are hard to access; those systems sit on infrastructure which is optimised for transactions rather than analytics.
We also have solutions which are familiar, even if they have taken different forms over the years: we pull the data out of our systems of record; we organise it into a form which is easier to analyse; and we place it on infrastructure which is optimised for analytics.
But, just as the problem and the solutions are familiar, so are the challenges in building and running those solutions. We have challenges in getting data into our analytics platform. We find it hard to make sense of once it is there. We have to deal with the complexities of implementing security, applying permissions and controlling access. If your experience is anything like mine, you will have found that it can be very hard to manage capacity and performance. And, if we overcome all of these challenges, and manage to build a stable, performant platform which contains all of the data we need, then we can and should expect that our data consumers will ask us to add ever more capabilities to the platform.
As you might expect, I think that many of the answers can be found on Cloud. Moving to a hyperscale Cloud data warehouse such as Google’s BigQuery won’t solve every problem of data ingestion and sense making - but it will solve problems of capacity and scale, and, as a product that is continuously being improved and developed, it can help meet demands for new capability.
However, I think that there is more to the move of analytics to Cloud platforms than solving capacity and infrastructure problems: I think that this move can shift our notion of what a data lake is. The problems of addressing analytics requirements on-premise can be so great that, even if we address them successfully, they become all absorbing. Even the term ‘data lake’ makes us focus on where we put the data (in the lake), not what we do with the data or how we consume it. It favours the receptacle over the content.
In my experience, once we start moving analytics to the Cloud, we also start shifting our perspective on the problem. I believe that we can start to see ourselves not as creating a ‘data lake’ but as creating a ‘front page for data’. What I mean by this is a vision of data in in which anyone in the enterprise can go to one page, see the complete universe of datasets which exist, see which are available to them (and request access to those which aren’t) and frame and execute queries without having to worry about the underlying infrastructure. And this isn’t just a vision - it is entirely achievable by using existing Cloud analytics platforms and exists in some enterprises today.
Finally, I believe that this experience of shifting perspective - from the ‘data lake’ to ‘the front page of data’ - contains a lesson for those of us working in technology strategy, architecture and design. It reminds us that the terms we use can make us focus on the technology rather than the people we use it - and that if we stand in a different place we can see things differently.