Many non-profit organisations collect and hold large quantities of data from multiple sources about people, services, products, supporting information, events, items to purchase etc. This data often takes many different formats including structured and unstructured data. Typically, this data is held across multiple, disparate environments and is subject to limit governance.
Identifying and managing what data assets the organisation creates, collects, holds, and serves becomes a somewhat overwhelming activity, especially when staff time, access and opportunity is limited. Getting a clear overview of the organisation-wide data asset becomes an overhead and slips down the priority list, especially as there is commonly no single person or department tasked with the ownership of the data asset.
In recent years cloud providers have started to offer solutions with capabilities to make it easier to ingest, store, and centralize data across an organization and you will increasingly hear terms bandied about such as data lakes, data pools, data puddles and data warehouses.
Understanding what these data-related terms mean and what the implications are for your organisation can be very difficult, especially as developments move forward at rapid pace.
This blog provides some basic definitions of these terms, and what they are not.
A data lake is a virtual, centralised repository that stores data from across an organisation, regardless of data format, structure, or type. A data lake sends and receives data from any database, data warehouse, or API.
This virtualised, centralised storage allows an organisation to store multiple data types (including streaming data and images/videos), and takes advantage of typical virtual benefits such as removing the need to purchase and manage their own on-site storage and backup and granular data security controls. This can all be achieved without the organisation needing to move away from its’ existing legacy systems.
Data lakes also provide benefits such as the use of tools like data discovery and cognitive search, predictive analytics, Artificial Intelligence (AI), and Machine Learning (ML) because the algorithms can easily access all the data, uncovering new relationships, patterns and correlations, aiding understanding of customer behaviours, journeys and engagement.
However, there does need to be Extraction, Transformation and Load (ETL) operations to put the data into specific formats that the AI or ML can digest, although this applies whether you are analysing a data lake, an SQL database or a data warehouse.
Data lakes provide the ability for organisations to index the centrally held data and provide searching through a common format such as a google or explorer-like search engine, providing a user-friendly search experience.
Data can be protected through the application of access policies that can be set down to individual data table or cell level, with the ability to also import and inherit the original security controls from the source data.
A data warehouse is a system that aggregates corporate information and data derived from operational systems and external data sources in a central repository. It is designed to allow users to run queries and analyses on data derived from transactional sources and is used to generate analytics and insight.
Data warehouses differ from data lakes in that the warehouse requires data to be pre-categorised and tagged (the metadata) before storage and extraction, but the two often exist together.
Data warehouses can be real-time, historical, off-line, or even as distinct subsets of the wider data warehouse used for a specific purpose.
The benefit of data lakes is the ability to store large amounts of data – including lots of data that used to be lost. But the problem is that when reporting and analysis against the data lake, the performance can be really slow – there’s just simply too much data there to process.
The queries that users need to run in order to perform their analytics might take several minutes…or hours. Or maybe run for an hour only to get no results, because the sheer scale of the data was just too big for the tool to deal with.
An answer to this challenge is the use of data puddles.
Data puddles are usually built for a small, focused team or specialised use case. Individual teams or departments want to evaluate collections of data owned by the team themselves and take the whole thing into their own hands. The reasons can be different, often because IT can’t or won’t do it or because there is no IT solution available.
These puddles are frequently built in the cloud by business units using shadow IT. The implementation can be done very professionally e.g., by a regional IT or also via easy to use SaaS apps by non-IT departments. An example is a Data Puddle based on Google Data Studio and data sources like smaller databases or Google Sheets and CSV files.
The disadvantages are then of course a double operation with the associated effort and costs. Data silos also arise, so that the entirety of the data can never be evaluated together. Even if such a Data Puddle is assimilated by the wider organisation and becomes more and more a standard tool but is not professionally maintained by IT, errors and failures can occur, which are particularly painful if this puddle is widely used and possibly even used for important reports or even products.
A data pool is a centralized repository of data where trading partners (e.g., retailers, distributors or suppliers) can obtain, maintain and exchange information about products in a standard format. Suppliers can, for instance, upload data to a data pool that cooperating retailers can then receive through their data pool.
A typical example might be buying books from Amazon. You see the same book, with general information about it and the author, reviews etc. but the purchase is made from any one of a number of collaborating retailers.
The world of data storage, aggregation, query and reporting is a rapidly evolving one, with new and seductively easy-seeming toolsets being developed by major players and made available through familiar user interfaces such as browser search engines.
The key for non-profit organisations is to look at the data sources in play, their users and their requirements, and then weigh these against the cost of acquisition, governance and day to day running before embarking on purchasing or developing solutions.
A data lake or puddle is doable, potentially very valuable, addition to an organisation’s technology stack but there is a lot of learning and understanding required before embarking on this journey.