Skip to main content

Preparing to run the AWS Glue collector

Note

The latest version of the Collector is 2.256. To view the release notes for this version and all previous versions, please go here.

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item

Requirement

Hardware (for on-premise runs only)

Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.

RAM

8 GB

CPU

2 Ghz processor

Software (for on-premise runs only)

Docker

Click here to get Docker.

data.world specific objects (for both cloud and on-premise runs)

Dataset

You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector.

If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.

Network connection

Allowlist IPs and domains



Preparing AWS Glue for the collector

Creating a user

  • Create a user for running the collector. The user must have the following permissions.

    Table 2.

    Permissions

    AWS Glue API

    Object cataloged using the permission

    glue:GetDatabases

    GetDatabases

    All databases in the specified data catalog.

    glue:GetTables

    GetTables

    All tables in the specified database.

    glue:ListJobs

    ListJobs

    All jobs in the specified AWS account.

    glue:GetJob

    GetJob

    The details of the specified job.

    s3: GetObject

    kms:Decrypt

    GetObject

    kms:Decrypt

    This is used to retrieve and parse the job script to catalog source and sink for the job.

    Note: Objects (AWS Glue script) uploaded to S3 are encrypted, hence the kms:Decrypt permission is required to access an encrypted object. If both permissions are not granted, the collector will not be able to harvest lineage metadata from job scripts containing annotations. However, jobs and tables are still harvested.



Setting up credentials file

  • Set up an AWS credentials file for authentication which contains the user profile to determine which AWS account's instance to catalog. Typically the AWS_CREDENTIALS_FILE is at [user’s home directory]/.aws/credentials. See the AWS documentation on configuration and credential file settings for information on setting up this file.