How To Refine Your Data Lake Strategy For Analytics
The InfoSec Consulting Series #33
By Jay Pope
The meteoric growth in cloud use and the increasing numbers of Internet of Things devices means that businesses are coping with larger volumes of data than ever before. Dealing with this data is a challenge and its volume and variety, together with the cost of storing and managing it can prove to be overwhelming. One of the technologies to which enterprises are increasingly turning to cope with this problem is the data lake. But what are they and what advantages do they offer?
What Is A Data Lake?
Data lakes are often confused with the concept of a data warehouse but there are key differences which are important to understand. A data lake is a central resource where you can store data whether it is in a structured or unstructured form. It can operate at any scale and it’s possible to run analytic tools over the data. A data warehouse, on the other hand, deals only with structured data. This is aggregated into various categories in order to make it easier to access for analytics purposes. In most cases, a warehouse is intended to serve a specific purpose, whereas a data lake contains data that is less defined and can be used for anything.
That’s not to suggest that data lakes are without organisation. A complete lack of any kind of control can lead to a data lake becoming a data swamp. Although a data lake is not highly structured, it does use features such as metadata to allow information to be found. A correctly managed data lake also needs to have a clear governance strategy.
Data Lake Applications
The nature of the data in lakes makes them ideal for use with artificial intelligence applications. Data warehouses, in contrast, are better suited to traditional database techniques. One key advantage of a data lake is that it’s less defined nature makes it more suited to research tasks, where data scientists may not have an exact idea as to what they are seeking. Because the data is in a raw format, it can be the basis of a self-service solution where people can use data analytics tools to create their own custom reports. This makes a data lake a good source of data for use in dashboard and business reporting applications.
Setting Up A Data Lake
So, what is the roadmap to developing a data lake for your business? There are four main stages involved which we’ll look at in more detail.
Repository
The first step is to create a repository for the data. This involves IT elements such as storage, either locally but more commonly in the cloud, together with relevant network connections. This allows the lake data to be stored in its raw format. You also need to identify what data is going to flow into the lake, where it originates and on what timescale it arrives. Security measures need to be considered at this stage as does GDPR compliance to ensure that the data you are holding is safe and legal.
Environment
Secondly, you need to create the environment in which your analytics or data science tasks can be carried out. This be a ‘sandbox’ wherein experiments can be carried out on the data stored and prototypes can be built for providing the information that the business needs. A range of proprietary and open-source tools may be used at this level.
Integration
The next stage is integrating the data lake with your business information systems. This might involve preparing data to load into a more structured data warehouse environment, where day-to-day business queries can be carried out. More speculative activities and experiments can still be carried out on the raw lake data.
Operational
The fourth and final stage is for the data lake to become a core component of the business’ data operations. It’s likely to be the case at this stage that most, if not all, of the company’s data will be passing through the data lake. It will, therefore, have become a key component of the IT infrastructure of the business and will have replaced many of the more traditional data storage silos.
Once the business has completed all four stages, it should be able to use data-as-a-service within the organisation, in addition to being positioned to introduce machine learning and artificial intelligence applications that will create value from the stored data.
It’s important to note that building a data lake is not an end. It should be a component of a wider strategy in dealing with information. It can be used to prepare data for injection into a data warehouse, to run one-off analyses or to act as a pool of ‘big data’ for AI applications. Implementation is often combined with an agile approach to ensure that the entire process is carried out quickly and effectively.
Does Your Organisation Need AI Advisory Services?
We are a team of experts with extensive knowledge and experience of helping organisations improve business performance. Our highly qualified consultancy team can deliver AI capability at all levels of your organisation and are on hand to help ensure your projects deliver solutions that are appropriately aligned to your AI risk position, and meet technical, business and ethics due diligence requirements. Schedule a call above to learn more about how we can help.