What is the data lake solution?

Many Amazon Web Services (AWS) customers require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. A data lake is an increasingly popular way to store and analyze data because it allows businesses to store all of their data, structured and unstructured, in a centralized repository. The AWS Cloud provides many of the building blocks required to help businesses implement a secure, flexible, and cost-effective data lake.

To support our customers as they build data lakes, AWS offers the data lake solution, which is an automated reference implementation that deploys a highly available, cost-effective data lake architecture on the AWS Cloud. The solution is intended to address common customer pain points around conceptualizing data lake architectures, and automatically configures the core AWS services necessary to easily tag, search, share, and govern specific subsets of data across a business or with other external businesses. This solution allows users to catalog new datasets, and to create data profiles for existing datasets in Amazon Simple Storage Service (Amazon S3) with minimal effort.

For the full solution overview visit Data Lake on AWS.

What can I do with the data lake solution?

When you build a data lake, your main customers are the business users that will consume data and use it for analysis. The most important things your customers are trying to achieve are agility and innovation. You made a bet when you decided to store data in your lake, your customers are looking to quickly cash this in when they start their new project.

Once your data lake is mature, it will undoubtedly feed several data marts such as reporting systems or enterprise data warehouses. Using the data lake as a source for specific business systems is a recognized best practice. However, if that is all you needed to do, you wouldn’t need a data lake. Having a data lake comes into its own when you need to implement change; either adapting an existing system or building a new one. These projects build on an opportunity for a competitive advantage and need to run as quickly as possible. Your data lake customers need to be agile. They want their projects to either quickly succeed or fail fast and cheaply.

The data lake solution on AWS has been designed to solve these problems by managing metadata alongside the data. You can use this to provide a rich description of the data you are storing. A data lake stores raw data, so the quality of the data you store will not always be perfect (if you take steps to improve the quality of your data, you are no longer storing raw data). However, if you use metadata to give visibility of where your data came from, its linage and its imperfections you will have an organized data lake that your customers can use to quickly find data they need for their projects.

The data lake solution is designed to manage a persistent catalog of datasets in Amazon Simple Storage Service (Amazon S3) and business relevant tags associated with each dataset. It also configures an AWS Glue crawler within each data package and schedules a daily scan to keep track of changes. The crawlers go through your datasets and inspect portions of it, to infer data schema and persist the output of this process as one or more metadata tables that are defined in your AWS Glue Data Catalog.

Once a dataset is cataloged within the data lake solution, the dataset attributes and associated tags are indexed by the data lake search engine. This enables users to search and browse the datasets available within the data lake and select datasets of interest that they require access to for their extended business needs. The data lake solution keeps track of the datasets selected by users to allow them to generate a manifest file with secure access links to the selected content. Users leverage the provided manifest file to access and process the datasets as required for their particular business requirements.

Data lake solution concepts

The central concept of this data lake solution is a package. This is a container in which you can store one or more files. You can also tag the package with metadata so you can easily find it again.

For example, the data you need to store may come from a vast network of weather stations. Perhaps each station sends several files containing sensor readings every 5 minutes. In this case, you would build a package each time a weather station sends data. The package would contain all the sensor reading files and would be tagged with metadata, such as, the location of each station with the date and time on which the readings were taken. You can configure the data lake solution to require that all packages have certain tags. This helps ensure you maintain visibility on the data added to your lake.

We Want to Hear from You

We welcome your feedback. To contact us, visit the AWS Solutions Forum.