Preparing to run the AWS Glue collector
Note
The latest version of the Collector is 2.256. To view the release notes for this version and all previous versions, please go here.
Setting up pre-requisites for running the collector
Make sure that the machine from where you are running the collector meets the following hardware and software requirements.
Item | Requirement |
---|---|
Hardware (for on-premise runs only) Note: The following specs are based upon running one collector process at a time. Please adjust the hardware if you are running multiple collectors at the same time. | |
RAM | 8 GB |
CPU | 2 Ghz processor |
Software (for on-premise runs only) | |
Docker | Click here to get Docker. |
data.world specific objects (for both cloud and on-premise runs) | |
Dataset | You must have a ddw-catalogs dataset set up to hold your catalog files when you are done running the collector. If you are using Catalog Toolkit , follow these instructions to prepare the datasets for collectors. |
Network connection | |
Allowlist IPs and domains |
Preparing AWS Glue for the collector
Creating a user
Create a user for running the collector. The user must have the following permissions.
Table 2.Permissions
AWS Glue API
Object cataloged using the permission
glue:GetDatabases
All databases in the specified data catalog.
glue:GetTables
All tables in the specified database.
glue:ListJobs
All jobs in the specified AWS account.
glue:GetJob
The details of the specified job.
s3: GetObject
kms:Decrypt
This is used to retrieve and parse the job script to catalog source and sink for the job.
Note: Objects (AWS Glue script) uploaded to S3 are encrypted, hence the kms:Decrypt permission is required to access an encrypted object. If both permissions are not granted, the collector will not be able to harvest lineage metadata from job scripts containing annotations. However, jobs and tables are still harvested.
Setting up credentials file
Set up an AWS credentials file for authentication which contains the user profile to determine which AWS account's instance to catalog. Typically the AWS_CREDENTIALS_FILE is at [user’s home directory]/.aws/credentials. See the AWS documentation on configuration and credential file settings for information on setting up this file.