About the AWS Glue collector (legacy version)

Use this collector to harvest tables, databases, columns, and jobs from AWS Glue.

Important

The AWS Glue collector can be run in the Cloud or on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.294. To view the release notes for this version and all previous versions, please go here.

What is cataloged

The collector catalogs the following information.

Table 1.

Object	Information collected
Database	Name
Table	Name
Column	Name, Column index, Column type, Column size, JDBC type
Job	Name

Relationship between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page	Relationship
Database	Table that is in the database
Table	Database that contains the table

Lineage for AWS Glue

Table 3.

Object	Lineage available
Table	Table that a Job sources its data from Table that a Job writes its data to
Job	Table that is the source or target of a job

Note: The collector identifies source and target tables based on AWS Glue annotations. AWS Glue will automatically add annotations when using the Visual ETL editor and Class script generation is enabled from the Script tab of any Job.

If your job was converted from visual mode to script-only mode, you will need to ensure that annotations are manually added to your sources and target. The following table lists the required annotations.

Table 4.

Type	Annotation	Example
Source	## @type: DataSource	## @type: DataSource ## @args: [format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0"] ## @return: DataSource0 ## @inputs: [] DataSource0 = glueContext.create_dynamic_frame.from_options(format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0")
Target (Sink)	## @type: DataSink	## @type: DataSink ## @args: [connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1"] ## @return: DataSink1 ## @inputs: [frame = Transform1] DataSink1 = glueContext.write_dynamic_frame.from_options(frame = Transform1, connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1")

Type

Annotation

Example

Source

## @type: DataSource

## @type: DataSource

## @args: [format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0"]

## @return: DataSource0

## @inputs: []

DataSource0 = glueContext.create_dynamic_frame.from_options(format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0")

Target (Sink)

## @type: DataSink

## @type: DataSink

## @args: [connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1"]

## @return: DataSink1

## @inputs: [frame = Transform1]

DataSink1 = glueContext.write_dynamic_frame.from_options(frame = Transform1, connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1")

AWS Glue version supported

The collector uses AWS SDK for Java 1.11. This is associated with AWS API version 2020-04-08. For more details, see this documentation.

Authentication supported

The AWS Glue Collector supports standard AWS client authentication methods. See the AWS documentation to learn more about AWS authentication methods.

In this section:

About the AWS Glue collector (legacy version)

Important

Note

What is cataloged

Relationship between objects

Lineage for AWS Glue

Important

AWS Glue version supported

Authentication supported

Search results