Community docs

Reltio and DWCC

Introduction

The steps to catalog the metadata from a data source using DWCC are as follows:

  1. Read over the DWCC FAQ to familiarize yourself with the data.world DWCC collector.

  2. Verify that your installation meets the prerequisites below.

  3. Validate that you have a ddw-catalogs (or other) dataset set up to hold your catalog files when you are done running the collector.

  4. Write the collector command and run it.

  5. Upload the resulting file to your ddw-catalogs dataset (or other dataset as configured for your organization).

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/.

  • The user defined to run DWCC must have read access to all resources being cataloged.

  • The computer running DWCC needs a Java Runtime Environment (JRE), version 11 or higher. (OpenJDK available here)

Permissions

We develop using admin credentials.

Collected from Reltio

The Reltio metadata collector runs against a static configuration file which contains the following:

  • Classes (what Reltio calls “entity types”) and properties (what Reltio calls “attributes”). Essentially, a type structure that expresses key business objects in your domain.

  • Association classes (what Reltio calls “relation types”) that are really just classes that express relationships between other entities. For example, Employee entity has a relationship to Company entity, and the relationship has start date and end date attributes. Employee entity has a relationship to Company entity, and the relationship has start date and end date attributes.

  • Inheritance relationships between entity types.

  • Special objects called “survivorship groups” that express how conflicting data for a given attribute is resolved (i.e., if two sources of data conflict, take the one that’s most recent. or take the value from System A over System B.) Basically, a directed graph of objects that is associated with an attribute on an entity.

  • Roles that are groupings of Entities to express transient typing. For instance, a Person and an Organization can both be in the role of Customer. But a given person can be a customer today and an employee tomorrow.

  • Attributes (which can be defined apart from entities, allowing them to be reused) that have properties like type, name, label, hidden, faceted, required, searchable.

  • Data cleansers that can transform the values of attributes from one form to another. A common example is a cleanser for street address that translates “ST”, “St.”, and “Str” to “Street”. An Entity Type is configured with zero to many cleansers that can be arranged in “chains”.

  • Matchers for entities that determine if two entities are the same. So two person entities might be the same if their SSN attributes are equal or if their cleansed names are fuzzy-matched within a certain threshold.

  • Survivorship strategies that express how to “merge” data from two entities that have been judged to be the same. Basically: which attribute values from the merged entities “win” as the value of the attribute for the final, combined entity.

  • Graph Types that appear to define specific hierarchies of Relationships with distinct semantic meaning.

  • Interaction Types that define events in which member Entity Types participate. Each interaction/event can have its own attributes.

This version of the Reltio collector focuses on Entity Types, Attributes, Roles, and Relationships. Additionally:

  • We identify the matchers and cleansers assigned to each class by name.

  • The collector works on a static configuration

  • We will only identify the first cleanser in a chain (if a multi-level chain).

  • We can identify the survivorship strategy for each attribute (property) by name.

  • Currently we do not produce glossary terms.

Writing the DWCC command

The easiest way to create your DWCC command is to:

  1. Copy the example command below

  2. Edit it for your organization and data source

  3. Open a terminal window in any Unix environment that uses a Bash shell and paste your command into it.

The example command includes the minimal parameters required to run the collector (described below)--your instance may require more. A description of all the available parameters is at the end of this article. Edit the command by adding any other parameters you wish to use, and by replacing the values for all your parameters with your information as appropriate. Parameters required by the collector are in bold.

Important

Do not forget to replace x.y in datadotworld/dwcc:x.y with the version of DWCC you want to use (e.g., datadotworld/dwcc:2.80).

Basic parameters

These are the basic parameters needed to run the collector. A list of all parameter options is at the end of this article. Where available, either short (e.g., -a) or long (--account) forms can be used.

Docker and DWCC

Detailed information about the Docker portion of the command can be found here. When you run the command, run will attempt to find the image locally, and if it doesn't find it, it will go to Dockerhub and download it automatically:

dwcc_and_cli.png

Collector runtime and troubleshooting

The catalog collector may run in several seconds to many minutes depending on the size and complexity of the system being crawled. If the catalog collector runs without issues, you should see no output on the terminal, but a new file that matching *.dwec.ttl should be in the directory you specified for the output. If there was an issue connecting or running the catalog collector, there will be either a stack trace or a *.log file. Both of those can be sent to support to investigate if the errors are not clear. A list of common issues and problems encountered when running the collectors is available here.

Upload the .ttl file generated from running the collector

When the DWCC collector runs successfully, it creates a .ttl file in the directory you specified as the dwcc-output directory. The automatically-generated file name is databaseName.catalogName.dwec.ttl. You can rename the file or leave the default, and then upload it to your ddw-catalogs dataset (or wherever you store your catalogs.

Caution

If there is already a .ttl catalog file with the same name in your ddw-catalogs dataset, when you add the new one it will overwrite the existing one.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.