About the AWS Glue collector
Use this collector to harvest tables, databases, columns, and jobs from AWS Glue.
Important
The AWS Glue collector can be run in the Cloud or on-premise using Docker or Jar files.
Note
The latest version of the Collector is 2.248. To view the release notes for this version and all previous versions, please go here.
What is cataloged
The collector catalogs the following information.
Object | Information collected |
---|---|
Database | Name |
Table | Name |
Column | Name, Column index, Column type, Column size, JDBC type |
Job | Name |
Relationship between objects
By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.
Resource page | Relationship |
---|---|
Database | Table that is in the database |
Table | Database that contains the table |
Lineage for AWS Glue
Object | Lineage available |
---|---|
Table |
|
Job |
|
Note: The collector identifies source and target tables based on AWS Glue annotations. AWS Glue will automatically add annotations when using the Visual ETL editor and Class script generation is enabled from the Script tab of any Job.
If your job was converted from visual mode to script-only mode, you will need to ensure that annotations are manually added to your sources and target. The following table lists the required annotations.
Type | Annotation | Example |
---|---|---|
Source | ## @type: DataSource | ## @type: DataSource ## @args: [format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0"] ## @return: DataSource0 ## @inputs: [] DataSource0 = glueContext.create_dynamic_frame.from_options(format_options = {"quoteChar":"\"","escaper":"","withHeader":True,"separator":","}, connection_type = "s3", format = "csv", connection_options = {"paths": ["s3://collectors-sap-hana"], "recurse":True}, transformation_ctx = "DataSource0") |
Target (Sink) | ## @type: DataSink | ## @type: DataSink ## @args: [connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1"] ## @return: DataSink1 ## @inputs: [frame = Transform1] DataSink1 = glueContext.write_dynamic_frame.from_options(frame = Transform1, connection_type = "s3", format = "csv", connection_options = {"path": "s3://adlsgen2ddwtestingstatefile", "partitionKeys": []}, transformation_ctx = "DataSink1") |
AWS Glue version supported
The collector uses AWS SDK for Java 1.11. This is associated with AWS API version 2020-04-08. For more details, see this documentation.