About the Amazon S3 collector
Use this collector to directly harvest metadata on S3 buckets and objects metadata from your Amazon S3 instance. Note that if you are looking to harvest tables and columns from Amazon S3 objects that have been cataloged in AWS Glue Data Catalog, you must instead use the AWS Glue collector.
Important
The Amazon S3 collector can be run in the Cloud or on-premise using Docker or JAR files.
Note
The latest version of the Collector is 2.252. To view the release notes for this version and all previous versions, please go here.
What is cataloged
The collector catalogs the following information.
Object | Information cataloged |
---|---|
Buckets |
|
Objects |
|
Relationships between objects
By default, the harvested metadata includes catalog pages for the following resource types. Each catalog page has a relationship to the other related resource types. If the metadata presentation for this data source has been customized with the help of the data.world Solutions team, you may see other resource pages and relationships.
Resource page | Relationship |
---|---|
S3 Bucket | Relationship to S3 Object |
S3 Object | Relationship to S3 Bucket |
Important note about maximum limits for S3 buckets
For an optimized experience, the system has set limits for harvesting metadata from the buckets in S3.
By default, the collector has a limit of harvesting 10,000 objects per bucket. If the contents of a bucket cross this limit, the bucket is skipped, no metadata is harvested for it, and a warning message is logged for the bucket.
If you want to overwrite the default limit, set the --max-resources parameter in your collector command. The maximum value for this parameter can be 10,000,000 (ten million). If the total contents (total buckets and objects) cross this limit, further buckets are not cataloged, and a warning message is logged for the bucket.
Authentication supported
The S3 collector authenticates to S3 using the default credential profiles file. The collector needs a user created in the AWS portal with read access to S3.