Running collectors using JAR files
Important
This topic only applies to on-premise collector runs.
Overview
The data.world catalog collector is a Java application consisting of a single Java Archive (JAR) file that is deployed in a Docker image, which is a lightweight, standalone package that includes all the prerequisites and dependencies to run the JAR file including the Java Runtime Environment. Most users elect to run the collector in a Docker container created from this image, because Docker makes it easy to quickly upgrade to the latest versions of dwcc and also enables common deployment scenarios such as container orchestration services (For example, Kubernetes, Amazon Elastic Container Service, etc.).
Some organizations may elect to run the collector directly from the JAR file rather than in a Docker container. This document describes the pre-requisites to run the collector as JAR files, setting up the run command, and setting up the JVM trust store when the collector connects to a source system that requires a custom SSL certificate.
Setting up pre-requisites
Make sure that the machine from where you are running the collector meets the following hardware and software requirements.
Item | Requirement |
---|---|
Hardware | |
RAM | 8 GB |
CPU | 2 Ghz processor |
Software | |
Java Runtime Environment | OpenJDK 17 is supported and available here. |
data.world specific objects | |
Dataset | You must have a ddw-catalogs (or other) dataset set up to hold your catalog files when you are done running the collector. |
Getting the JAR file
From Release 2.182 onwards, the release notes for each collector version include the link to download the JAR file of that specific version.
Note
If you need the JAR file for an older version, contact the data.world support team.
Installing Open JDK 17
Follow the instructions listed here to install the Microsoft Build of OpenJDK. You will need to follow the install steps associated with your operating system.
Run the following command to confirm that OpenJDK is installed.
java -version.
Creating a directory for log files
On the machine from where you are running the collector create a directory to store the collector log files.
Completing collector specific pre-requisites tasks
Follow the instructions on the collector pages to complete the collector specific pre-requisite tasks.
Generating and modifying the collector run command
Follow the instructions on the collector pages and generate the command for running the collector.
If you are using a YAML file for running the collectors, make the following change to the YAML config file.
Replace output: /dwcc-output with output: <path to output directory>
Make the following edits to the command to adapt it for running the collectors using JAR files.
Replace the Docker specific command to run the collector with a JAR file:
Remove docker run -it --rm and Docker image reference such as datadotworld/dwcc:2.163
AND
Replace with java -jar <path to jar file>
Remove any Docker volume mount --mount options.
When using the JAR file, the collector automatically picks the environment variables from the host machine. For the YAML file command, remove any -e parameters that are used to set the environment variables in Docker.
Specify the directory for log files: -Dlogdir=<path to log directory>
Specify the directory for your output files: -o=<path to output directory>
If the collector requires any database (JDBC) drivers for the source system, download those drivers into a directory, and provide it to the collector using the following parameter:
-Djdbc.driver-directory=<path to drivers directory>
For instance:
java -Djdbc.driver-directory=<path to drivers directory> -jar <path to jar file>
When using PowerShell, you must use double quotes while adding the driver "-Djdbc.driver-directory=<your driver directory>".
For instance:
java "-Djdbc.driver-directory=<path to drivers directory>" -jar <path to jar file>
Let us edit a sample Windows command for Power BI to demonstrate these changes:
Table 2.With Docker
With JAR file
Example command for YAML file
docker run -it --rm --mount type=bind,source=$Env:HOMEPATH\dwcc,target=/dwcc-output \
--mount type=bind,source=$Env:HOMEPATH\dwcc,target=$Env:HOMEPATH\dwcc \
-e DW_AUTH_TOKEN=${DW_AUTH_TOKEN} \ -e DW_AZURE_SECRET=${DW_AZURE_SECRET}
datadotworld/dwcc:2.163
--config-file=/
dwcc-output
/config-power_bi.ymljava -Dlogdir=<path to log directory> -jar <path to jar file>
--config-file=/
<path to output directory>
/config-power_bi.ymlExample command without YAML file
docker run -it --rm --mount type=bind,source=$Env:HOMEPATH\dwcc,target=/dwcc-output `
--mount type=bind,source=$Env:HOMEPATH\dwcc,target=/app/log datadotworld/dwcc:2.163 `
catalog-powerbi --collector-metadata=<config> ...rest of the command follows...
java -o=<path to output directory> -Dlogdir=<path to log directory> -jar <path to jar file>`
catalog-powerbi --collector-metadata=config-id=<config> -o=
<path to output directory>
...rest of the command follows...
Handling Custom certificates for source systems
If your source technology requires a custom SSL certificate to connect, follow the steps listed here to add the certificate to the JVM trust store.
Setting log levels to generate debug and trace level logs
By default the collector produces INFO, WARN, and ERROR log messages when the collector runs. You may be asked by data.world support to obtain debug-level log messages for additional information when a collector runs. To enable debug level logging, add the following to the java command:
-Dlog_level=DEBUG
For instance:
java -Dlog_level=DEBUG -jar [path]
Trace level logs may sometimes be required for further troubleshooting. To enable Trace level logging, add the following to the java command:
-Dlog_level=TRACE
For instance:
java -Dlog_level=TRACE -jar [path]