Skip to main content

About the AWS Glue collector

Use this collector to harvest tables, databases, columns, and jobs from AWS Glue.

Important

The AWS Glue collector can be run in the Cloud or on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.200. To view the release notes for this version and all previous versions, please go here.

What is cataloged

The collector catalogs the following information.

Table 1.

Object

Information collected

Database

Name

Table

Name

Column

Name, Column index, Column type, Column size, JDBC type

Job

Name



Relationship between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page

Relationship

Database

Table that is in the database

Table

Database that contains the table



Lineage for AWS Glue

Table 3.

Object

Lineage available

Table

  • Table that a Job sources its data from

  • Table that a Job writes its data to

Job

  • Table that is the source or target of a job



Note: The collector identifies source and target tables based on AWS Glue annotations. AWS Glue will automatically add annotations when using the Visual ETL editor and Class script generation is enabled from the Script tab of any Job.

If your job was converted from visual mode to script-only mode, you will need to ensure that annotations are manually added to your sources and target. The following table lists the required annotations.

Table 4.

Type

Annotation

Example

Source

## @type: DataSource

## @type: DataSource

## @args: [format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0"]

## @return: DataSource0

## @inputs: []

DataSource0 = glueContext.create_dynamic_frame.from_options(format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0")

Target (Sink)

## @type: DataSink

## @type: DataSink

## @args: [connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1"]

## @return: DataSink1

## @inputs: [frame = Transform1]

DataSink1 = glueContext.write_dynamic_frame.from_options(frame = Transform1, connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1")



AWS Glue version supported

  • The collector uses AWS SDK for Java 1.11. This is associated with AWS API version 2020-04-08. For more details, see this documentation.

Permissions

  • The user running the collector must have permissions to GetDatabases, GetTables, ListJobs, and GetJobcreds on AWS. For more information see the Amazon API reference docs.