Analytics has become essential to drive the intelligent businesses. However, with the addition of this multitude of unstructured data – think email, videos, photos, audio files, presentations, webpages – businesses have had to find a new way to pool, access and analyse data to extract the insights they need. Data lakes provide an answer.
As data lakes become part of industry jargon, clients are asking pointed questions:
- What exactly is a data lake?
- How does it differ from a traditional data warehouse?
- How is a data lake organised and managed?
- Do we need to upgrade our analytics platform to leverage a data lake?
- How can we get more from our investment in creating a data lake?
Let’s start with a definition: a data lake is essentially a massive collection of all the data a company collects about its customers, operations, transactions and more. If a company has many individual pools of data, you’ll need to hop around in each each pool to find the fish you seek. Not so with a data lake. Information flows into a data lake from these pools so you can find the fish you want in one place, and find patterns and trends among all the fish.
Does the data lake beat data warehousing?
A data warehouse is the traditional way of gathering and storing a large collection of data. It is highly organised and structured. In a data warehouse, data doesn’t get stored in its original form; its ‘transformed’ into a very exact predefined data structure before it’s loaded into the warehouse.
This highly structured approach means it can solve a very specific set of problems very quickly. Unfortunately, that same structure can make it practically impossible to solve other queries based on the data without significant additional development. A data lake, on the other hand, can be applied to a large number and wide variety of problems — precisely because it lacks structure and organisation. The lack of a pre-defined schema gives a data lake more versatility and flexibility.
However, rather than completely replace data warehouses, data lakes are often designed to complement them, taking on some of the data processing work of a data warehouse or hosting new analytics applications.
How do you organise and manage data in data lakes?
Data lakes use a flat, schema-less organisation structure. Data is left in its natural form, creating a collection of data with many different formats, with unique identifiers and meta tags that lets users more easily hunt down the data they need. The open-ended nature of a data lake also allows analysts to discover answers in new ways.
Hadoop, an open source technology, is the predominant architecture for data lakes. It offers some distinct advantages:
- Its schema-less structure and schema-on-read capabilities mean a Hadoop data lake can hold a diverse mix of structured, unstructured and semi-structured information, providing the data flexibility needed
- The ability to use commodity hardware gives a Hadoop-based data lake a tremendous economic advantage over other approaches
- Putting a data lake on Hadoop provides a central location from which all the data and associated meta-data can be managed, lowering the cost of administration.
- Organisation need to explore data governance platforms that integrate with Hadoop to provide enterprise grade management.
Data lakes are for discovery too
Data warehouses evolved to answer the highly structured, everyday questions that analysts asked, typically around the transactional aspects of the business. Data lakes are meant to solve problems that are not as structured and require “discovery”.
For example, analysts may know what question they need answered, but not what combination of data and analysis will reveal the answer. A data lake allows iterative exploration and application of different, often more complex analytic functions to reveal useful insight.
Are traditional BI tools the best fit for the data lake?
The business intelligence tools that accompany a data warehouse are designed to work with the warehouse structure, allowing the analyst to “slice and dice” the data along the structure provided in the warehouse. Similarly, analytic platforms that are used to solve problems on data lakes need to embrace the versatility and loose structure.
While the underlying Hadoop technology provides the versatility a data lake needs, many existing analytic platforms are not designed to take advantage of this versatility, leaving many companies struggling to get real value out of their data lakes. Analytic platforms built natively for Hadoop are needed.
These platforms are designed to use the varying data types, structures and formats found in a data lake, providing an analyst workbench that can answer a much greater array of questions, discover new hidden patterns in the data and offer highly granular yet actionable insights.
Want more from your data lake?
Native analytic platforms for Hadoop will help you get the most value out of your data lake, but you will need to select a platform that provides the versatility and flexibility your organisation needs, and allows users to be more effective.
Some elements to consider:
- Can the analytics platform combine complex data with sophisticated analysis? Identify the type of analytics the organisation wishes to conduct and the approaches and data sources that will be needed. Behavioural analytics will, for example, identifies hidden patterns and clusters of attributes within data sources such as clickstream data, product usage logs and transactions. Fraud detection analytics approaches, on the other hand, uncover paths and identify hidden patterns using transaction logs, geo-spatial data and event activity.
- Does the analytics platform offer data and modelling flexibility? A flexible modelling capability will add data attributes quickly to allow an analyst to rapidly iterate through many scenarios. The technical concept that enables the flexible modelling is schema-on-read. This is an evolution of the schema-on-write concept used by data warehouses where the schema is fixed.
- Is the analytics platform analyst friendly? Does it cover the entire analytic process, from data integration to data preparation, analysis and visualisation, and is it designed for self-service, allowing analysts to execute the entire process on their own.
What’s sure is that data lakes are becoming an essential element to combine the breadth of data needed for advanced analytics. Now is the time to be exploring your options.
By Gary Allemann, MD of Master Data Management