Enterprise docs

Catalog collector change log

5-24-22 DWCCv2.80

Hash: be5a85c754d54328accabe332dec55ce507baddbe68d2fe9e29a211e9ea1420f

  • With this release, DWCC now requires Java 17. If you run the collector from within Docker this change will not affect you. If you run DWCC from a .jar file, you will need to upgrade your JRE to 17 to run DWCv2.80 and greater.

5-13-22 DWCCv2.79

Hash: 5b548c82b96ad5e5dbd4770adff205c9d07cac3c5f949882d7d9381240366ddb

  • The Manta collector can now accept OAuth tokens for MANTA authentication (for harvesting metadata from manta version R35 and above)

  • We have released a new collector powerbigov that only allows tenantid for auth and not user/password and connects to the government powerbi api urls.

5-11-22 DWCCv2.78

Hash: 71edd8ff7a4c3ed8a91eaf36d59c8e2745b7a76f8666b5750cbee8205021c9c6

  • Added some small Tableau collector enhancements.

  • New PowerBiGov collector with specific endpoints for .gov customers. This collector does not accept a username or password.

  • For PowerBI, a new way to authenticate is available. A user can now enter a tenant ID with a client id and a client secret to authenticate, in addition to using a username and password.

  • For both PowerBi and PowerBiGov, when using the tenantid, secret and client id authentication method, this collector no longer emits information about PowerBI Apps.

4-27-22 DWCCv2.77

hash: 4bed848791cfa9e46c9db4a78c7a593bb1c986900dc6fcfcd4255ddce1528579

  • Fixes an issue with the Snowflake collector that prevented the bundled jdbc driver from being found. Any users working with dwcc 2.76 should update.

4-22-22 DWCCv2.76

hash: 30e60a4434ee64d2981b40eb2dc92506da3d367eab22bc0bca0c61bdd44a3f02

  • The Snowflake collector harvests some intra-database lineage information from database views.

  • Improved the host mapping in the Manta collector.

4-7-22 DWCCv2.75

hash: 1a59dbb3ff8679fb6ee22eadaeb04ccdb28c5660be029e78fbc96403ae33096f

  • the Manta collector now emits resources for file sources and targets and their directory structure. It also emits sources and targets as files.

4-1-22 DWCCv2.74

hash: 219428f6a72be91205408d5cb3f8cc8b27e1a9a4df0208e4cacb8fbaa1352f90

  • The Tableau collector now emits “column-level lineage”:

  • Improved styling of DWCC command-line errors

  • Updated command-line options for Datakin and Marquez.

3-16-22 dbt collector v.05

This version adds a third command-line argument to specify an output file name.

3-8-22 DWCCv2.73

hash: 119daf987dcfad25db599e1c1affedf17a35ff2aa002d0618d642eb309cebaaf

  • Permalinks to Looker explores included via externalUrl

  • Improvements to datakin/marquez collectors

  • Tableau collector now emits resources for Tableau Projects, allowing us to establish full relationships between projects and the workbooks and views that they contain

  • Monte Carlo data collector now emits data quality information using enhanced dwec ontology concepts

  • Looker collector now emits descriptions for measures and dimensions

  • MANTA collector now emits Snowflake resources found in MANTA scans

3-1-22 DWCCv2.72

Hash: 62d156aca58ec92513e8d6490f00fd10ee52dfb7a65f71c20c6a988c938dfddd

  • [BUGFIX] Invalid prefix when using --base option

  • Update dwcc transform to add catalog events to specific collectors

  • Added a Snowflake Sensitive Data Discovery collector

  • Sync CLI options between collector types

  • Validation of CLI options for DWCC

  • Improvements to the DWCC CLI

  • Update the MonteCarlo Collector to use the new Data Quality Ontology

2-17-22 DWCCv2.71

Digest: 03fc3df90ae63896d62ea22e00688f42cacf5b76d0f47691c06c104736680b2a

  • Bug fix for Marquez collector

  • Bug fix for Manta collector

2-9-22 DWCCv2.70

Digest: 06bb747c4d7705c1e44664de7854158d87468316bab549ec5604b0a075380c69

  • Preview images for Tableau assets are now harvested much more efficiently, and the resulting image data in the catalog graph are much smaller, reducing catalog harvest run time and enabling image objects to remain within platform constraints during ingest.

  • Fix for unexpected column type errors in BigQuery collector

2-8-22 DWCCv2.69

Digest: 5ab9b97d5f8f4568613438a9e52b0bdc12974f8d6edd0dab374a281c4982c737

  • Created new collectors for Marquez and Datakin

  • Added schema information to the Tableau collector outputs

2-4-22 DWCCv2.68

Digest: 23674ee02a6b725d5f9a453615dc507286da2ee606dca83c386472f3aa36d118

  • The Tableau collector now accepts Tableau “Personal Access Tokens” for authentication, via new cli options --tableau-pat-name and --tableau-pat-secret.

  • Fixed an issue with mis-identification of views as tables in BigQuery.

2-2-22 DWCCv2.67

Digest: 032867c9c52c8d46dc0b90a61a128be65ecec1440bb0adccb8b0d1b249b4e351

  • Fixed an issue with server name identification in Manta.

1-26-22 DWCCv2.66

Digest: fa9ae2eb3d68375a3ff01ac7bde98fd36f372b84dce0d411444146ea9566b47b

  • With this release the Athena collector is no longer a JDBC collector--we harvest metadata by accessing the Athena API directly, rather than going through a JDBC driver. This means that it is no longer necessary to provide a JDBC driver when running the collector.

1-10-22 DWCCv2.65

Digest: ed08cdd21a374c30456de0989076f5180bc4187ca998358b051807e521fd44e6

  • This release adds a new option for the MANTA collector, --manta-max-parallel-scenarios. Specifying this option and passing an integer value will configure the MANTA API to export the specified number of scenarios in the MANTA graph in parallel. The default value is 4; adjusting this up or down can improve performance.

1-5-22 DWCv2.64

Digest: 45b72798b0602885790388331a75db1f4286b15bf57b21f30f416eda79041571

This release upgrades dwcc’s dependency on the Apache Jena RDF library to version 4.2.0, which addresses security vulnerability https://nvd.nist.gov/vuln/detail/CVE-2021-39239.

12-23-21 DWCCv2.63

Digest: sha256:eb4208c914269c793a5e2143d59a9982e7b087c5da1c17dd075e02a326e64a3e

  • The Athena JDBC driver is no longer bundled with DWCC as we have discovered that the Athena driver itself has a dependency on a vulnerable version of log4j. Customers that use the DWCC Athena collector will now need to supply their own driver and put it in the jdbc driver directory (as is done with other collectors for which we don’t distribute a driver).

12-15-21 DWCCv2.62

Digest: sha256:2cd579e09f4eee94e141e8cf7e4e40e9a9b8803029df1be7112d67d62ef33b9e

  • The Oracle collector now supports connecting to the database via SID (instance ID) or Service Name. Service Name is the default. If a connection via SID is desired, pass the SID as the value of the -d/ --database option and add the --oracle-sid-mode option (flag).

12-13-21 DWCCv2.61

Digest: sha256:bd0ba96208d714ecef4131867cf5d16372be0a33f416c1d6bd01f132c8517323

  • The information schema collector has been modified so that the files table_constraints.csv and constraint_column_usage are now optional, not required.

12-10-21 DWCCv2.60

Digest: sha256:7fd825bfe7d2f99c9a1298ad26bc1934c9657cc7c5868dd093844344d18fc7b7

  • Updated the BigQuery collector to support current Google Cloud API enhancements.

  • Added a new Information Schema Collector. This collector runs via the {{catalog-information-schema}} command and is notably cataloging four CSV files that are provided to the collector via a {{--csv-file-directory}} parameter rather than connecting to a database. This collector is an option for customers with tricky DB setups that do not allow them to authenticate or establish connections to their DB via our normal DWCC collectors.

12-2-21 DWCCv2.59

Digest: sha256:051f76748be1c6cf2c7557600dde71a39e1b822c9e49120881ce938f1c8c2b80

  • Verified the Manta collector works with MANTA R34.

  • Released the config file command.

  • Modified the Tableau collector to remove schema and database names from table names.

  • Updated the BigQuery collector to support cataloging all datasets in a project at once by default, and to be able to use cli options to select specific datasets in a project as well. With this last change, the  --dataset param is no longer required. The help text has been updated with new messaging to reflect these changes.

11-10-21 DWCCv2.58

Digest: sha256:82ebc1cec46f70de000aa94695359bd28d65c2782afc362c9ce14fadc04eae07

  • Added a new collector for Hive (as an alternative to catalog-hive) that uses only the Hive metastore--it does not connect to the Hive server directly. 

  • The PowerBI collector now harvests workspaces and identifies other assets as being in workspaces

  • DWCC now emits “catalog events” into the catalog graph. These capture details about the cataloging process itself, including selected configuration options with which dwcc was run, and summary statistics about the catalog. The ingest process will soon extract this information from catalogs at ingest time and send them to segment for downstream analysis.

11-1-21 DWCCv2.57

Digest: sha256:606f7cfbe60bf56b4c2ecd5fb3902d4de621e31ae76ad78e68c56c788f81e5e6

  • Fixed an issue in the Tableau collector in which Custom SQL Table objects without an associated database were not handled correctly.

10-27-21 DWCCv2.56

Digest: sha256:335f7e110a9506d95dff05971492e6509fb8537e74f9275d04dcf9e2427df0f0

  • Added new cli options to salesforce collector so that it can handle sandbox environments and custom login domains customers might have.

10-25-21 DWCCv2.55

Digest: sha256:c60ae69edc88b8801be833d578ef5dca73b6302646be9b30d31ccdfd7444288a

  • This release updates the BigQuery collector to handle fields in BigQuery tables for which the BigQuery API returns null type.

10-5-21 DWCCv2.53

Digest: sha256:59c960d525e66e77d08dd34fd58c9b5027334a4bd2271f1f059370ae006a4b0b

  • Enhancements to the MANTA collector to harvest additional lineage information from MANTA scans (lineage from Informatica PowerCenter in particular)

  • Tableau collector enhancement to provide a better warning to the user when an obsolete version of the Tableau API is specified

9-29-21 DWCCv2.52

Digest: sha256:915e4e91841001f80a84a65fcd76350b9a1d53f4e31678bb0e628d32beab94a1

  • Fixed an issue with the handling of certain fields and database information when the Tableau collector was run with a non-admin credential.

9-28-21 DWCCv2.51 (internal)

Digest: sha256:261c5bf33b2ae38cbda35a346fcb37c56bbf8ebfb773f328deb9140efba1c8bf

  • Fixedan issue with the Tableau collector issue to handle views/workbooks that exist outside of a project.

9-28-21 DWCCv2.50 (internal)

Digest: sha256:b407c629247f36afac3869eb8320464fce8caeb2865dd79811882b54ef94d1b5

  • Fixed an issue with the Tableau collector to handle workbooks that exist outside of projects.

9-24-21 DWCCv2.49

Digest: sha256:397e78867f41aaa393ff69f42b0fa524fdcad662ddd027925cf27f80497b24ce

  • Added a collector for Salesforce (catalog-slesforce)

  • Fixed a IRI mismatch issue for Tableau Collector when running on Tableau instances with a Snowflake datasource.

9-18-21 DWCCv2.48

Digest: sha256:c36755489b6235408aa4e639e6e184cab027a32a34e3b8ca369c3c6b3c4bff96

  • Made internal improvements to the tableau collector to enable more efficient querying of the Tableau metadata api.

  • Fixed an issue in the manta collector in which certain missing data in the MANTA lineage graph caused an exception

9-10-21 DWCCv2.47

Digest: sha256:219edfa247929e15d7c4e2be99ef890b2487c398abc1a23b2f85b3de11812be3

  • Fixed an issue in the Reltio collector that occured when a Reltio configuration was missing certain objects.

  • Added a collector for Databricks (catalog-databricks)

9-8-21 DWCCv2.46

Digest: sha256:e48cba45b457e076714d94d3a83d1164cb892864213732b3b2b334c041ff178a

  • Fixed an issue with creation of resource IRIs by certain collectors when the user chooses version 1 minting

  • Updated BigQuery collector to enable integration with data.world platform / connection manager

  • Fixed an issue with the MANTA collector in which certain large MANTA scans caused a numeric overflow during json de-serialization

  • Updated Reltio collector to include information about survivorship groups in the emitted catalog

8-24-21 DWCCv2.45

Digest: sha256:77f4c784b1d0166cf3bb87903696528f712fbe6aee1d4cb7e60097a0f494c7de

  • This release fixed an issue with JDBC drivers not being loaded by the Athena collector.

  • Added a collector for Reltio configurations (catalog-reltio).


Digest: sha256:47c1bb38b88c25801adf1f765e23c63637d15a60ae11fca8d63b53a8cd4755b2

  • Fixes an issue with URLs for sheets and dashboards that exist in Tableau Online or in Tableau Server within a site other than the default site.




  • Additional datetime fields added for Looker objects and typed as xsd:dateTime.

  • Fixed an issue caused by an undocumented change in Tableau Online’s REST API when using the Tableau collector to harvest metadata from Tableau Online.


Digest: sha256:e6bc353ea4b2ec3486b54d4e9280856d328d93f5d406e367c0c50303cde93704

  • The generic jdbc collector harvests database name when cataloging Intersystems Cache databases

  • Running the Snowflake collector with the -A / --all-schemas option harvests metadata from all available schemas, as with other collectors


Digest: sha256:bb79aa8afd19bf35b4b7e75840c21598702ec1d74b5f8640cc72a6758a3a0bc9

  • Fixed an issue with permalinks to objects in the MANTA collector .


DIGEST: sha256:44dd710a49a1500863f49e2f2e4ef261a45cdc6c7354702fe8e764210c27293b

  • Added support for Looker folders and additional attributes to the Looker metadata collector.

  • Added the ability to preview images to the Tableau metadata collector.


Digest: sha256:992671530f7483bfeb8a2aab52880a524b7df79caf427b373bd825115d71f4dc

  • Fixed an issue with the handling of certain special characters in catalog resource IRIs.

  • The --schema option for JDBC collectors can now be specified multiple times to enable the cataloging of multiple schemas in a single catalog.


Internal release


Digest: sha256:6a84217fa33df75d67ce51c486a90a802a8313a3432835abb55fffb5f1d3afc7

  • Updated Tableau collector to paginate additional graphql queries to avoid hitting Tableau Metadata API limits.

  • Updated the Hive2 collector to capture table-level metadata from the hive metastore

  • Updated the Tableau collector to allow the user to exclude specified Tableau objects from the catalog


Digest: sha256:8dd9793f3b0e74adcd7e7bc153f06b8c3098470217fb07af4336dde611269671

  • Improvements to error messages produced when using a config-file to run DWCC

  • We disallow running catalog-postgres and catalog-redshift in the same config file as the two collectors use incompatible JDBC drivers

  • Improved error handling throughout DWCC

  • Improvements in representation of Tableau data source names in tableau catalogs

  • Improvements to the MANTA collector

DWCC v2.35 Changes in this release:

  • Upgrade of Denodo collector to Denodo 8

  • Handle edge case of very large field values embedded in manta’s exported artifacts

  • Support for sites

  • Handle edge case of stored procedure columns in manta

DWCC v2.34 This release includes:

  • Enhancements to domo collector output

  • Testing improvements

  • A minor tableau collector enhancement

  • Fix for an issue in the tableau collector in which column fields were sometimes not properly identifying the Tableau Table from which they sourced their data

  • Improvment to the presentation of domo catalogs in the platform UI.

  • Changes to the dockerhub repository where we house images containing non-released versions of dwcc. Previously we were calling these “beta” releases; we now call them “release candidates”. The new repository is datadotworld/dwcc-rc and the image tags are x.y-rc-z where x.y is the next expected dwcc release, and z is an increment.

DWCC v2.33 Adds support for harvesting intra-database lineage from manta scans, and accommodates changes in MANTA R32 (aka 1.32). We no longer support MANTA versions earlier than MANTA R32.

DWCC v2.32 This release adds in collector support for Vertica db.

DWCC v2.31 Issued fix to ensure alignment of identifiers for databases referenced by Tableau and Looker collectors.

DWCC v2.30 Installed a config file-driven configuration (as a hidden feature for now). Issued a fix for handling empty powerbi objects returned by the API

DWCC v2.29 The data.world catalog collector now supports Tableau Online! Additionally there was a bugfix for PowerBi.

DWCC v2.28 Bugfix release

DWCC v 2.27 Added the optional CLI option tableau-graphql-page-size to the Tableau collector which allows the user to set a number of objects to be included in each page of paginated queries.

DWCC v2.26 Updated the PowerBi collector so that if a report is unavailable via the API it will be logged, and cataloging will continue on the rest of the repository.

DWCC v2.25 This release includes better and more user-friendly error handling and reporting. We have also added an enhanced collection of Tableau metadata via the Tableau Metadata API (graphql endpoint). New metadata includes data sources, databases, fields, metrics, and many more inter-object relationships.

DWCC v2.24 DWCC is now distributed via Dockerhub Additionally there are changes to the Tableau and PowerBI collectors, and the ability to change the level of error messages written to the console and log file, and a new subcommand to display the DWCC license text.

For Tableau:

  • The Tableau collector now emits RDF in which the object of `dct:creator` is a `dwec:Agent` instead of a string literal. This means we write additional details about the Tableau account that created the dashboard, via properties of the `dwec:Agent` resource. These details include: account name, account “full name”, and account email address (if they are populated in Tableau).

For PowerBI:

  • The PowerBI collector writes resources representing powerbi “data sources” that are now of a PowerBI-specific class, rather than `dwec:DataArtifact`.

Logging changes:

  • It is now possible for users to set the level (severity) of log messages written to the console and log file. By default, we write “info” level messages; users can choose to write only errors (level=“ERROR”), errors+warnings (level=“WARN”), or all messages including debug trace (level=“DEBUG”). This is useful if we want to have customers run DWCC with debug logging turned on, for troubleshooting problems etc.

Display DWCC license information:

  • License information for DWCC is now available as a subcommand of DWCC. To get all licensing information, run the command docker run -it --rm datadotworld/dwcc:X.XX display-license where X.XX is a version of DWCC greater than or equal to 2.24.

DWCC v2.23 Internal release

DWCC v2.22 Internal release

DWCC v2.21 fixed some timeout issues with Looker collector when fetching images from the Looker API. Fixed an issue with cataloging reports and dashboards based on user workspace permissions in PowerBi.

DWCC v2.20 With this release our Tableau collector now supports cataloging of workbooks and non-dashboard views as well as harvesting tags on workbooks and views. FIxed an issue in the Looker collector where preview images returned from looker api were missing.

DWCC v2.19 Includes a clean-up of the embedded help commands for several collectors and:

  • Fixes an issue with the Tableau Server collector when cataloging multi-site server instances.

  • Adds --tableau-site parameter to enable user to restrict cataloging to a single site (not required, by default all sites in the instance are scanned). Value provided to --tableau-site can be a site ID or name.

DWCC v2.18 The tableau collector now has a flag option --tableau-skip-images which skips the harvesting of preview images for views. Usage is like this:

... catalog-tableau --tableau-api-base-url=http://ec2-44-192-86-11.compute-1.amazonaws.com/api/3.10/ --tableau-username=admin --tableau-password=password -a sc-test3 -n tableau-test --tableau-skip-images

DWCC v2.17 Adds a collector for Presto

DWCC v2.16 This release:

  • Adds the parameter --all-databases to the Athena collector so that it can catalog all the databases accessible from the logged-in account.

  • Fixes some issues with datatypes for dwec:externalUrl predicates.

DWCC v2.15 This release contains the following:

  • The Tableau collector formerly had a CLI parameter --tableau-project-id which could be used to catalog only assets in the project with the specified ID. The parameter is now --tableau-project and takes either a project ID or project name

  • Update to the MANTA collector to accommodate a minor change in the MANTA API with v 1.31. Customers who have updated their MANTA instance to v 1.31+ will want to use DWCC 2.15+.

  • The Looker collector now works for non-admin Looker users; however, when DWCC is run by a non-admin, the emitted catalog will not contain any information about databases used by Looker analysis assets (access to database information in Looker requires admin permissions).

  • All JDBC collectors now populate two new properties for dwec:DatabaseColumndwec:columnDefaultValue  and dwec:columnIsNullable, which contain the default value for that column in newly inserted rows, and whether the column can be null, respectively. (Note that only some databases/drivers provide this metadata…we put it in the catalog if it’s there).

DWCC v2.14 Adds a collector for Looker. Minor update to the docker-save.sh script that includes available versions in the error message if you don’t supply a version.

DWCC v2.13 Adds cli params with this version so it now possible to pass arbitrary driver properties through to the connection

DWCC v2.12 Adds collector for SAP (formerly Sybase) SQL Anywhere metadata collector

DWCC v2.11 Improves the Dremio collector’s handling of data sources nested within multiple layers of folders, and fixed a minor issue with the Dremio collector’s harvesting of lineage metadata from the Dremio graph API.

DWCC v2.10 Adds a collector for Domo and JDBC database collectors can now catalog all schemas in the database at once (default remains to catalog only user's default schema).

DWCC v2.9 Adds Tableau Server collector and extended the OpenAPI collector to include a few additional schema property metadata properties.

DWCC v2.8 Adds Infor ION data lake collector. Optimized collection of JDBC metadata (performance improvement).

DWCC v2.7 Adds a collector for PowerBI.

DWCC v2.6 Adds the Manta collector.

DWCC v2.5 Upgrads Java runtime.

DWCC v2.4 Extends handling of OpenAPI collector parameters and responses.

DWCC v2.3 Adds support for OpenAPI (fka Swagger) collector.

DWCC v2.2 A refactoring release.

DWCC v2.1 Fixes an issue with the Denodo cataloger jdbc url port.

DWCC v2.0 We now use v2 URIs as the official locator IDs for metadata resources. This is a breaking change (for structural, intentional reasons) which is not backwards compatible with v1 URIs. For more information see the article on DWCC v2.X.

DWCC v 1.20 Addresses some memory issues and open-cursor leaks.

DWCC v.1.19 Adds writing statements to the catalog graph indicating that the catalog was DWCC by DWCC (with a version). We also added the ability to write database schema objects to the catalog graph.

DWCC v1.18, Allows you to specify alternate organization permissions and upload locations when performing an automatic upload of the metadata.

DWCC v.1.16 and DWCC v.1.17 Address issues with the SQL Server cataloger.

DWCC v.1.15 Adds Dremio support with optional Catalog API lineage fetching.

DWCC v1.14, Enables you to change the amount of memory that gets allocated to a DWCC docker process. See our article on allocating additional memory to Docker for more information.

DWCC v.1.13 Adds support for Microsoft SQL Server, and we enable JVM to use available memory in the container (useful for creating large catalogs). Additionnally we Improve data type recognition in AWS Glue cataloger.

As of DWCC v1.12 we can support not only Glue ETL jobs, but also Glue Data Catalog tables and columns.

With DWCC v.1.11 you can:

  • Upload generated catalogs via the --upload / -U command-line parameters

  • Upload the DWCC log when uploading generated catalogs with --upload

  • Fetch an organization's current catalog with the fetch-catalog command

In DWCC v1.10 we added support for AWS Glue and AWS Athena including cataloging ETL jobs associated with an AWS account. There is no need to mount in a jdbc drivers directory as the Glue cataloger uses the Glue API, not JDBC.

dwc v.1.9 is a bug cleanup release.

It is now possible with DWCC.1.8 to use jdbc drivers on classpath as well as those found in user-specified JDBC Driver Directory (drivers in directory have higher precendence than classpath drivers).

DWCC v.1.7 is a bug-fix release

DWCC v.1.6 adds the support for arbitrary jdbc data sources and the ability to build one-off docker images for testing, demos, etc.,

With DWCC v.1.5 we add support for Oracle.

In DWCC.1.4 we add support for Google BigQuery.

DWCC v.1.3 brings much new functionality including:

  • Support for Denodo and Snowflake

  • Compatibility of JDBC catalogs with tables imported through data.world integrations

  • Ability to differentiate source information for databases cataloged from localhost

  • Cataloging of REMARKS fields into dct:descriptio

With DWCC v.1.2 we support Redshift databases.

DWCC v.1.1 contains documentation clarification and expansion for the documents to streamline tags on customer docker hosts.

The initial release of DWCC v.1.0 provides support for metadata catalog extraction for DB2, Hive, MySQL, Postgres.