Skip to main content

Running collectors using JAR files

Important

This topic only applies to on-premise collector runs.

Overview

The data.world catalog collector is a Java application consisting of a single Java Archive (JAR) file that is deployed in a Docker image, which is a lightweight, standalone package that includes all the prerequisites and dependencies to run the JAR file including the Java Runtime Environment. Most users elect to run the collector in a Docker container created from this image, because Docker makes it easy to quickly upgrade to the latest versions of dwcc and also enables common deployment scenarios such as container orchestration services (For example, Kubernetes, Amazon Elastic Container Service, etc.).

Some organizations may elect to run the collector directly from the JAR file rather than in a Docker container. This document describes the pre-requisites to run the collector as JAR files, setting up the run command, and setting up the JVM trust store when the collector connects to a source system that requires a custom SSL certificate.

Setting up pre-requisites

Make sure that the machine from where you are running the collector meets the following hardware and software requirements.

Table 1.

Item

Requirement

Hardware

RAM

8 GB

CPU

2 Ghz processor

Software

Java Runtime Environment

OpenJDK 17 is supported and available here.

data.world specific objects

Dataset

You must have a ddw-catalogs (or other) dataset set up to hold your catalog files when you are done running the collector.



Getting the JAR file

Installing Open JDK 17

  1. Follow the instructions listed here to install the Microsoft Build of OpenJDK. You will need to follow the install steps associated with your operating system.

  2. Run the following command to confirm that OpenJDK is installed.

    java -version.

Creating a directory for log files

  • On the machine from where you are running the collector create a directory to store the collector log files.

Completing collector specific pre-requisites tasks

Generating and modifying the collector run command

  1. Follow the instructions on the collector pages and generate the command for running the collector.

  2. If you are using a YAML file for running the collectors, make the following change to the YAML config file.

    1. Replace output: /dwcc-output with output: <path to output directory>

  3. Make the following edits to the command to adapt it for running the collectors using JAR files.

    1. Replace the Docker specific command to run the collector with a JAR file:

      Remove docker run -it --rm and Docker image reference such as datadotworld/dwcc:2.163

      AND

      Replace with java -jar <path to jar file>

    2. Remove any Docker volume mount --mount options.

    3. When using the JAR file, the collector automatically picks the environment variables from the host machine. For the YAML file command, remove any -e parameters that are used to set the environment variables in Docker.

    4. Specify the directory for log files: -Dlogdir=<path to log directory>

    5. Specify the directory for your output files:  -o=<path to output directory>

    6. If the collector requires any database (JDBC) drivers for the source system, download those drivers into a directory, and provide it to the collector using the following parameter:

      -Djdbc.driver-directory=<path to drivers directory>

      For instance:

      java -Djdbc.driver-directory=<path to drivers directory> -jar <path to jar file>

      When using PowerShell, you must use double quotes while adding the driver "-Djdbc.driver-directory=<your driver directory>".

      For instance:

      java "-Djdbc.driver-directory=<path to drivers directory>" -jar <path to jar file>

  4. Let us edit a sample Windows command for Power BI to demonstrate these changes:

    Table 2.

    With Docker

    With JAR file

    Example command for YAML file

    docker run -it --rm --mount type=bind,source=$Env:HOMEPATH\dwcc,target=/dwcc-output \

    --mount type=bind,source=$Env:HOMEPATH\dwcc,target=$Env:HOMEPATH\dwcc \ 

    -e DW_AUTH_TOKEN=${DW_AUTH_TOKEN} \ -e DW_AZURE_SECRET=${DW_AZURE_SECRET}

    datadotworld/dwcc:2.163

    --config-file=/dwcc-output/config-power_bi.yml

    java  -Dlogdir=<path to log directory> -jar <path to jar file>

    --config-file=/<path to output directory>/config-power_bi.yml

    Example command without YAML file

    docker run -it --rm --mount type=bind,source=$Env:HOMEPATH\dwcc,target=/dwcc-output `

    --mount type=bind,source=$Env:HOMEPATH\dwcc,target=/app/log datadotworld/dwcc:2.163 `

    catalog-powerbi --collector-metadata=<config> ...rest of the command follows...

    java -o=<path to output directory> -Dlogdir=<path to log directory> -jar <path to jar file>`

    catalog-powerbi --collector-metadata=config-id=<config> -o=<path to output directory>...rest of the command follows...



Handling Custom certificates for source systems

  • If your source technology requires a custom SSL certificate to connect, follow the steps listed here to add the certificate to the JVM trust store.

Setting log levels to generate debug and trace level logs

  1. By default the collector produces INFO, WARN, and ERROR log messages when the collector runs. You may be asked by data.world support to obtain debug-level log messages for additional information when a collector runs. To enable debug level logging, add the following to the java command:

    -Dlog_level=DEBUG

    For instance:

    java -Dlog_level=DEBUG -jar [path]

  2. Trace level logs may sometimes be required for further troubleshooting. To enable Trace level logging, add the following to the java command:

    -Dlog_level=TRACE

    For instance:

    java -Dlog_level=TRACE -jar [path]