Working with packages

The central component of this data lake solution is a package. This is a container in which you can store one or more files. You can also tag the package with metadata so you can easily find it again. Creating a package allows businesses to store and share data of any size in their native form with corresponding business relevant metadata. Once a package is created, users can search across the data lake to quickly find data and consume it in a way that fits their business needs.

Note: Organizations can require specific tags when creating datasets, fields marked with an asterisk are required.

Create a Package

  1. In the provided text field, enter the Package Name.
  2. In the provided text field, enter the package Description.

    Note: The description should detail what the package is comprised of and what business need it is best suited for.

  3. Enter any additional metadata that your business may require, to all new packages to enforce conformity.

  4. Select the Visibility of the package. This sets permissions for the groups of users that can search this package. For more information, see working with groups.
  5. Select Create Package. Once the package is created, you can add files and additional metadata to the package to build its contents. Screenshot

Adding Metadata to a Package

Once the package is created you can begin adding additional metadata tags.

  1. Select Add Metadata to add tags. Screenshot
  2. In the provided text field, enter the Tag name.
  3. In the provided text field, enter the Value. Screenshot

    Note: To edit an existing tag, select Add Metadata, enter the existing tag name and change the Value. This will overwrite the existing tag.

Adding Content to a Package

You can add content to a new or existing package. You can add content to your package by either uploading a local file or linking existing data you already have stored in Amazon S3.

  1. Select File Name or Manifest File.
  2. Select the folder icon to the right to begin browsing related files available for upload. Screenshot

    Note: If uploaded data does not automatically populate, you can select the refresh button to update the package contents. To delete a file, select the "x" to the right of the uploaded file.

Using manifest files

If you are linking existing content in Amazon S3, you will need to create a manifest file that provides the location of the content. When importing data, you must specify a single include path. The syntax is bucket-name/folder-name/file-name.ext. To include all objects in a bucket, specify just the bucket name in the include path.

After you specify an include path, you can then exclude objects from being inspected by AWS Glue crawler. The solution supports the same kinds of Glue exclude patterns as AWS Glue support. For more information, see the AWS GLue service documentation. The following example shows the JSON to link existing Amazon S3 files.

{
    "dataStore": [
        {
            "includePath": "s3://<sample-bucket-name>/nyc-taxi-tlc/yellow/",
            "excludePatterns": ["2010/*", "2011/*", "2012/*", "2013/*", "2014/*"]
        },
        {
            "includePath": "s3://<sample-bucket-name>/nyc-taxi-tlc/green/",
            "excludePatterns": ["2013/*", "2014/*"]
        },
        {
            "includePath": "s3://<sample-bucket-name>/nyc-taxi-tlc/fhv/"
        }
    ]
}

Important: data-lake-packages-role and data-lake-package-crawler-role must provide access to your Amazon S3 imported datasets.

Integrations

This solution automatically configures an AWS Glue crawler within each data package and schedules a daily scan to keep track of changes. The crawlers go through your datasets and inspect portions of it, to infer data schema and persist the output of this process as one or more metadata tables that are defined in your AWS Glue Data Catalog.

Once created, this catalog provides a unified metadata repository across a variety of data sources and formats, integrating with Amazon Athena and Amazon Redshift Spectrum to interactively query and analyze data directly in your data lake, and with Amazon EMR, AWS Glue extract, transform, and load (ETL) jobs and any application compatible with the Apache Hive data warehouse so you can categorize, clean, enrich, and move your data Screenshot

To view the generated AWS Glue tables, select the external links located right after each table name. This will redirect you to AWS Glue console. Screenshot

To view table data, select the external links located on View Data. This will redirect you to Amazon Athena console. Screenshot

History of a Package

Select the History Tab to see a log of activities associated with the data lake package. Screenshot

Editing a package

Package details can only be edited by the original creator or a data lake administrator. Editing an existing data package will overwrite the current package.

Delete a package

Deleting a package will remove the entry from the data lake and delete the dataset files from Amazon S3. Note that if you have an AWS Glue crawler running, it will not be deleted, and must be deleted manually.

Screenshot

See Also

Searching the data lake

Working with my cart

Back to top