Skip to main content

About the AWS Glue collector

Use this collector to harvest tables, databases, columns, and jobs from AWS Glue.

Important

The AWS Glue collector can be run on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.289. To view the release notes for this version and all previous versions, please go here.

What is cataloged

The collector catalogs the following information.

Table 1.

Object

Information collected

Glue data catalog database

Name

Glue data catalog table

Name, Average record size, Classification, Columns ordered, Columns quoted, Compressed, Compression type, Field delimiter, Last access time, Multi-dialect view, Record count, Registered with lake formation, Serialization library, Size key, Skip header line count, Storage input format, Storage output format, Stored as subdirectories, Table type, Type of data, Updated-by crawler

Column

Name, Column index, Column type, Column size

Job

Name, Arguments, AWS Role, Created at, Script (object), Last Run

Job run

Identifier, Start time, End time, Execution Elapsed time



Relationship between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page

Relationship

Database

  • Table that is in the database

Table

  • Database that contains the table

  • Columns in the table

Column

  • Table that contains the column



Lineage for AWS Glue

Table 3.

Object

Lineage available

Table

  • S3 Object that sources data to the table.

  • Table the sources data to the table via a job.



Note: The collector identifies source and target tables based on AWS Glue annotations. AWS Glue will automatically add annotations when using the Visual ETL editor and Class script generation is enabled from the Script tab of any Job.

If your job was converted from visual mode to script-only mode, you will need to ensure that annotations are manually added to your sources and target. The following table lists the required annotations.

Table 4.

Type

Annotation

Example

Source

## @type: DataSource

## @type: DataSource

## @args: [format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0"]

## @return: DataSource0

## @inputs: []

DataSource0 = glueContext.create_dynamic_frame.from_options(format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0")

Target (Sink)

## @type: DataSink

## @type: DataSink

## @args: [connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1"]

## @return: DataSink1

## @inputs: [frame = Transform1]

DataSink1 = glueContext.write_dynamic_frame.from_options(frame = Transform1, connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1")



AWS Glue version supported

  • The collector uses AWS SDK for Java 1.11. This is associated with AWS API version 2020-04-08. For more details, see this documentation.

Authentication supported

  • The AWS Glue Collector supports standard AWS client authentication methods. See the AWS documentation to learn more about AWS authentication methods.