About the Azure Data Lake Storage Gen2 collector
Use this collector to directly harvest metadata on Azure Data Lake Storage Gen2 storage accounts, containers, and files from your Azure Data Lake Storage Gen 2 instance or Azure Blob Storage instance.
Important
The Azure Data Lake Storage Gen2 collector can be run in the Cloud or on-premise using Docker or Jar files.
Note
The latest version of the Collector is 2.248. To view the release notes for this version and all previous versions, please go here.
Authentication supported
Authenticate to Azure Data Lake Storage Gen 2 using Service principal.
What is cataloged
The collector catalogs the following information from Azure Data Lake Storage Gen 2.
Object | Information collected |
---|---|
Storage Account | Name, Description, Last Modified, Resource Group name, Region Name, Creation Time, Subscription ID, Account Status, Account Kind, Access Control, Access Tier, Provisioning State, Tags |
Container | Name, Description, Server, Last Modified, Metadata, Subscription ID, Entity Tag, Public Access, Access Control |
Blob | Name, Description, File URL, File Path, Blob Type, Content Length, Creation Time, Last Modified, Metadata, Subscription ID, Entity Tag, Access Control |
Relationships between objects
By default, the data.world catalog will include catalog pages for the resource types below. Each catalog page will have a relationship to other related resource types. Note that the catalog presentation and relationships are fully configurable, so these will list the default configuration.
Resource page | Relationship |
---|---|
Storage Account |
|
Container |
|
Blob |
|
Important things to note about maximum resource limits
By default the collector harvests metadata from Azure Data Lake Storage Gen 2 with up to 10,000 objects in each Storage Account. If your Azure Data Lake Storage Gen 2 has more than 10,000 objects in a given Storage Account, you must set the --max-resource-limit parameter to what you want. The max value can be set to 10 million. If the contents of a Storage Account cross this maximum limit, the Storage Account is skipped and a warning message is logged for the Storage Account.