Skip to main content

About the Azure Data Lake Storage Gen2 collector

Use this collector to directly harvest metadata on Azure Data Lake Storage Gen2 storage accounts, containers, and files from your Azure Data Lake Storage Gen 2 instance or Azure Blob Storage instance.

Important

The Azure Data Lake Storage Gen2 collector can be run in the Cloud or on-premise using Docker or Jar files.

Note

The latest version of the Collector is 2.252. To view the release notes for this version and all previous versions, please go here.

Authentication supported

Authenticate to Azure Data Lake Storage Gen 2 using Service principal.

What is cataloged

The collector catalogs the following information from Azure Data Lake Storage Gen 2.

Table 1.

Object

Information collected

Storage Account

Name, Description, Last Modified, Resource Group name, Region Name, Creation Time, Subscription ID, Account Status, Account Kind, Access Control, Access Tier, Provisioning State, Tags

Container

Name, Description, Server, Last Modified, Metadata, Subscription ID, Entity Tag, Public Access, Access Control

Blob

Name, Description, File URL, File Path, Blob Type, Content Length, Creation Time, Last Modified, Metadata, Subscription ID, Entity Tag, Access Control



Relationships between objects

By default, the data.world catalog will include catalog pages for the resource types below. Each catalog page will have a relationship to other related resource types. Note that the catalog presentation and relationships are fully configurable, so these will list the default configuration.

Table 2.

Resource page

Relationship

Storage Account

  • Relationship to Containers contained within Storage Account

Container

  • Relationship to Blobs contained within Container

  • Relationship to Storage Account containing Container

Blob

  • Relationship to Container containing Blob



Important things to note about maximum resource limits

  • By default the collector harvests metadata from Azure Data Lake Storage Gen 2 with up to 10,000 objects in each Storage Account. If your Azure Data Lake Storage Gen 2 has more than 10,000 objects in a given Storage Account, you must set the --max-resource-limit parameter to what you want. The max value can be set to 10 million. If the contents of a Storage Account cross this maximum limit, the Storage Account is skipped and a warning message is logged for the Storage Account.