Agile Data Lake – Architecture Options With Open Source Tools

The InfoSec Consulting Series #34

By Jay Pope


With digital transformation steaming ahead for around half of all businesses, it’s a great time to be an architect. Big data is everywhere; businesses are finding it so easy to collect data that they just get on with it. Here’s a thing though, according to Gartner around 85% of these projects fail. Projects fail for many reasons and big data presents its own unique set of challenges. These can be exacerbated by gung-ho development, starting to build code without proper architectural oversight. It’s too easy to create a “data swamp” where end users simply cannot find the information they want. Here we examine some of the challenges we face as architects and the typical requirements we must satisfy. We go on to propose a solution, an Agile Data Lake. It’s the opposite of a data swamp.


Architecture Challenges

As architects, a big data project presents us with a set of complex challenges. Each implementation will have its own technical and business variations, but they can be distilled into three generic questions:

  • What data is available?
  • Is it trustworthy?
  • How can the data from different sources be integrated?

These challenges are amplified by the proliferation of technologies and perceptions of the people involved.

The data originates from different sources, different originating systems and technologies. There could be on-premise data in a database, cloud-based data in a transaction processing system or a hybrid. The data may already be clean and high quality, or it may need to be cleansed. Finally, as new business systems are implemented, the number of data sources increases.

Each business area has a distinct perception of its own data. Staff will recognise the data using their own business terminology, specific to their roles and the way they use the data. Our challenge is to govern and unify the data while continuing to support this range of perceptions.


Architecture Requirements

The project’s architecture requirements will contain both functional and non-functional needs:

  • A Shared Business Vocabulary (SBV) is required to identify and define the data. A technology such as Apache Atlas could be used;
  • Data must be organised into a centralised logical structure;
  • Data must be segregated into zones. A transient zone for short-lived data, a raw zone for the maintenance of raw data, a trusted zone for validated data and possibly a refined zone for enriched data from external tools;
  • Data must be cleansed and integrated, using a staging area such as HADOOP or cloud storage;
  • Lifecycle processes must be defined for data ingestion, adaptation and consumption;
  • Automated techniques may be necessary for discovering and cataloguing disparate data and data relationships;
  • The traceability or lineage of the data must be preserved;
  • Governance processes must be defined for data classification and policies;
  • Data access and security must be defined, with access coordinated on all platforms and provision for future audits;


Architecture Solution – Agile Data Lake

A data lake is a “repository of data” which is held in its “natural/raw” format. It may include structured, semi-structured and unstructured data. Wikipedia also defines a data swamp, which has “deteriorated and unmanaged data” and therefore provides little value to the business.

One issue that has dogged the management of big data projects is that of creating a fixed and inflexible data structure.

A solution to the issue is to adopt Agile thinking. An Agile Data Lake is one in which the data structure is not frozen. Rather it is defined when the data is needed and when specific data requirements are known. An Agile Data Lake will, therefore, support changes to the data model as the business grows and changes, for example, when new business systems are implemented. To be truly Agile, our Data Lake Management must:

  • Allow the data model to be extended;
  • Handle all data additions as inserts – there must be no presumption of update or delete;
  • Allow for scaling of both the data and its usage;
  • Allow for future technology such as AI and automated processing.

An Agile Data Lake will support the storage of structured, semi-structured and unstructured data. It will hold data in transient, raw, trusted and enriched forms but notably, the structure will not limit its use for reporting, analytics and so forth.

As architects, we are familiar with patterns. In a data lake, we need to adopt a pattern that is driven by the metadata and that implements requirements for governance and security of data.

One element of the pattern is the data lifecycle, which consists of 3 distinct phases:

  • Ingestion into a staging area for processing;
  • Adaptation for use by the business or for further processing;
  • Consumption by analytics, data mining, data aggregation, visualisation and reporting.


Open Source Tools & Technologies

We will already be familiar with the options for data storage, for example:

  • RDBMS tools such as MySQL, PostgreSQL and SQLite
  • Non-relational tools such as Cassandra and MongoDB

To handle the sheer variety and volume of the data, we will need to implement a distributed data processing framework. Apache HADOOP is a group of programs, all open-source, with exactly this purpose. It consists of 4 modules:

  • A distributed file system, allowing data to be stored across multiple devices;
  • MapReduce, which is used to read from the database in a form suitable for analysis;
  • HADOOP common, a set of Java tools giving file access to operating systems such as Windows and Unix;
  • YARN, to manage the resources of data and analysis systems.

HADOOP can also be extended with a large set of components for specific processing requirements. Tools such as Apache Pig can be used for analysing large data sets.

Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. It supports implementation of features such as data warehousing and provides a set of interconnected tools. It is more advanced than HADOOP owing to its ability to process data in memory, rather than repeated storage access. This also makes it highly suited to machine learning applications.


Data Lake Or Data Swamp?

By applying Agile thinking and establishing good architectural practices, we have an opportunity to build big data projects we can be proud of. End-users will be able to find the information they need and the system will grow and scale along with the needs of the business. After all, no-one wants to be stuck in a swamp.


Does Your Organisation Need Top Cyber Security Consultants?

We are a team of experts with extensive knowledge and experience of helping organisations improve business performance. Our highly qualified consultancy team can deliver cyber security capability at all levels of your organisation and are on hand to help ensure your projects deliver solutions that are appropriately aligned to your cyber security risk position, and meet technical, business and ethics due diligence requirements. Schedule a call above to learn more about how we can help.