Skip to main content

The data.world Collector 2.0

Introduction

With the release of v2.0 of the data.world Collector there are some fundamental changes in how we catalog your data. Starting with the Collector v2.X we now use v2 URIs as the official locator IDs for metadata resources. This is a breaking change (for structural, intentional reasons) which is not backwards compatible with v1 URIs. In this article we will cover the migration-related implications for this change and how to determine your best course of action.

Context

We devised v2 URIs to improve how customers use our data.world Collector, and how they can store metadata. We also wanted to ensure uniqueness as well as accommodate MANTA, AWS Glue, and other future use cases (such as 3rd party data quality tools) where data may come from many different sources and will need to be intelligently merged together in data.world.

Technical details and caveats

The new v2 URIs use server+port+databaseName+organization+context as the five key factors that are hashed together. These factors allow for a deterministic and unique identifier. The “context” component is hardcoded to “default” for now, but it gives us an option for handling both future requirements and one-off customer situations where it is necessary to disambiguate data sources. For non-database systems, a similar but system-appropriate pattern will be applied for v2 URIs.

New installations and upgrades

All new data.world Collector implementations should use V2 URIs. When you use a data.world Collector 2.0 or newer, you will automatically use V2 URIs. Existing customers who either did not make any changes at the user edit layer (aka API or UI metadata edits) or who do not care about the changes made (for example, the edits were just for testing purposes) should either delete or replace/overwrite the v1 catalog files in ddw-catalogs and run a data.world Collector2.X collector. Running the 2.X collector will generate the latest catalog with v2 URIs.

Note

The edit layer will not automatically link up with the new URIs so the edits will essentially be gone.

We recommend all customers shift to v2 URIs as soon as possible in order to minimize important user edit layer changes as those edits will have to be migrated. If you have made significant changes and would still like to migrate now, please contact us for assistance.

If, however, you have made important changes at the user edit layer already you can continue to use the v1 URIs. Then To continue to use the v1 URIs include the parameter -use-v1-uris flag when you run a data.world Collector.x collector. On the roadmap for development is the creation of an API endpoint to allow the user edit layer to be mass updated. We will also provide a SPARQL template to migrate the user edit layer from the v1 URIs to v2 URIs. More details on this migration process will be provided as we get closer to this API endpoint being available.