About the AWS Glue collector

Use this collector to harvest tables, databases, columns, and jobs from AWS Glue.

Important

The AWS Glue collector can be run on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.292. To view the release notes for this version and all previous versions, please go here.

What is cataloged

The collector catalogs the following information.

Table 1.

Object	Information collected
Glue data catalog database	Name
Glue data catalog table	Name, Average record size, Classification, Columns ordered, Columns quoted, Compressed, Compression type, Field delimiter, Last access time, Multi-dialect view, Record count, Registered with lake formation, Serialization library, Size key, Skip header line count, Storage input format, Storage output format, Stored as subdirectories, Table type, Type of data, Updated-by crawler
Column	Name, Column index, Column type, Column size
Job	Name, Arguments, AWS Role, Created at, Script (object), Last Run
Job run	Identifier, Start time, End time, Execution Elapsed time

Relationship between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page	Relationship
Database	Table that is in the database
Table	Database that contains the table Columns in the table
Column	Table that contains the column

Lineage for AWS Glue

Table 3.

Object	Lineage available
Table	S3 Object that sources data to the table. Table the sources data to the table via a job.

Note: The collector identifies source and target tables based on AWS Glue annotations. AWS Glue will automatically add annotations when using the Visual ETL editor and Class script generation is enabled from the Script tab of any Job.

If your job was converted from visual mode to script-only mode, you will need to ensure that annotations are manually added to your sources and target. The following table lists the required annotations.

Table 4.

Type	Annotation	Example
Source	## @type: DataSource	## @type: DataSource ## @args: [format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0"] ## @return: DataSource0 ## @inputs: [] DataSource0 = glueContext.create_dynamic_frame.from_options(format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0")
Target (Sink)	## @type: DataSink	## @type: DataSink ## @args: [connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1"] ## @return: DataSink1 ## @inputs: [frame = Transform1] DataSink1 = glueContext.write_dynamic_frame.from_options(frame = Transform1, connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1")

Type

Annotation

Example

Source

## @type: DataSource

## @type: DataSource

## @args: [format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0"]

## @return: DataSource0

## @inputs: []

DataSource0 = glueContext.create_dynamic_frame.from_options(format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0")

Target (Sink)

## @type: DataSink

## @type: DataSink

## @args: [connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1"]

## @return: DataSink1

## @inputs: [frame = Transform1]

DataSink1 = glueContext.write_dynamic_frame.from_options(frame = Transform1, connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1")

AWS Glue version supported

The collector uses AWS SDK for Java 1.11. This is associated with AWS API version 2020-04-08. For more details, see this documentation.

Authentication supported

The AWS Glue Collector supports standard AWS client authentication methods. See the AWS documentation to learn more about AWS authentication methods.

In this section:

About the AWS Glue collector

Important

Note

What is cataloged

Relationship between objects

Lineage for AWS Glue

Important

AWS Glue version supported

Authentication supported

Search results