Preparing to run the AWS Glue collector (legacy version)

The latest version of the Collector is 2.294. To view the release notes for this version and all previous versions, please go here.

Setting up pre-requisites for running the collector

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item	Requirement
Hardware (for on-premise runs only) Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time.
RAM	8 GB
CPU	2 Ghz processor
Software (for on-premise runs only)
Docker	Click here to get Docker.
data.world specific objects (for both cloud and on-premise runs)
Dataset	You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector. If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors.
Network connection
Allowlist IPs and domains	Follow these instructions to configure your network. Use these tools to check network connections before running the collector.

Create a user for running the collector. The user must have the following permissions.

Table 2.

Permissions	AWS Glue API	Object cataloged using the permission
glue:GetDatabases	GetDatabases	All databases in the specified data catalog.
glue:GetTables	GetTables	All tables in the specified database.
glue:ListJobs	ListJobs	All jobs in the specified AWS account.
glue:GetJob	GetJob	The details of the specified job.
s3: GetObject kms:Decrypt	GetObject kms:Decrypt	This is used to retrieve and parse the job script to catalog source and sink for the job. Note: Objects (AWS Glue script) uploaded to S3 are encrypted, hence the kms:Decrypt permission is required to access an encrypted object. If both permissions are not granted, the collector will not be able to harvest lineage metadata from job scripts containing annotations. However, jobs and tables are still harvested.

Set up an AWS credentials file for authentication which contains the user profile to determine which AWS account's instance to catalog. Typically the AWS_CREDENTIALS_FILE is at [user’s home directory]/.aws/credentials. See the AWS documentation on configuration and credential file settings for information on setting up this file.

In this section: