Skip to main content

About the Amazon S3 collector

Use this collector to directly harvest metadata on S3 buckets and objects metadata from your Amazon S3 instance. Note that if you are looking to harvest tables and columns from Amazon S3 objects that have been cataloged in AWS Glue Data Catalog, you must instead use the AWS Glue collector.

Important

The Amazon S3 collector can be run in the Cloud or on-premise using Docker or JAR files.

Note

The latest version of the Collector is 2.200. To view the release notes for this version and all previous versions, please go here.

What is cataloged

The collector catalogs the following information.

Table 1.

Object

Information cataloged

Buckets

  • Amazon resource number (ARN)

  • Region

  • Name

  • Version state

  • Creation date

  • ACL owner ID

  • ACL grantee ID

  • ACL grant permission

Objects

  • Key

  • Amazon resource number (ARN)

  • Region

  • Size

  • Last modified date

  • ACL owner ID

  • ACL grantee ID

  • ACL grant permission

  • Metadata



Relationships between objects

By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.

Table 2.

Resource page

Relationship

S3 Bucket

Relationship to S3 Object

S3 Object

Relationship to S3 Bucket



Important note about maximum limits for S3 buckets

For an optimized experience, the system has set limits for harvesting metadata from the buckets in S3. 

  • By default, the collector has a limit of harvesting 10,000 objects per bucket. If the contents of a bucket cross this limit, the bucket is skipped, no metadata is harvested for it, and a warning message is logged for the bucket.

  • If you want to overwrite the default limit, set the --max-resources parameter in your collector command. The maximum value for this parameter can be 10,000,000 (ten million). If the total contents (total buckets and objects) cross this limit, further buckets are not cataloged, and a warning message is logged for the bucket.

Authentication supported

  • The S3 collector authenticates to S3 using the default credential profiles file. The collector needs a user created in the AWS portal with read access to S3.