Docs portal

Security

Security

Security is a paramount concern for enterprise customers who are evaluating new systems. The data.world service is not just cloud-first, but it's also security-first. We’ve designed it from the ground up to ensure that we can support a unique combination of internal and external compliance needs. In the following articles we have documented how we keep both your data and your connections safe, and how we enable you to do the same.

SOC 2 Type II Compliance

data.world has been designed for the secure handling of data and utilizes practices proven to be secure in the cloud. We regularly audit our configurations against the latest CIS Benchmark for AWS which was created by a large community of cybersecurity experts. We also engage a third-party security firm to perform penetration testing annually.

SOCII.png

Our comprehensive security program applies current tools and practices to network connectivity, access control (such as identity management and role controls), data handling (such as encryption), incident management, and more.

How our data connections interact with your data

There are three main ways that data.world can interact with data from your environment:

  1. Metadata collection, including optional Technical Lineage Server

  2. Virtual query capability

  3. Data extract/import capability

Typically for the vast majority of security and compliance needs, (1) Metadata collection and (2) virtual query are sufficient, and (3) data extract/import can be optionally disabled. If only (1) and (2) are leveraged, data.world does NOT store data in our platform--only metadata.

Metadata collection, including optional Technical Lineage Server

Docker microservices (or Java application in the case of the Technical Lineage Server) are deployed in the customer environment by the customer. These services only transmit metadata to data.world:

  • Source system descriptive information

  • Schema information such as tables and columns

  • Object-oriented descriptive information such as the titles/names of dashboards or reports and when they were created

Transmission of this information is over HTTPS with optional custom SSL certificate support. This information is non-sensitive in nature, and can be managed by data stewards, data governance, and data security professionals in your organization for further curation and access control.

Data extract/import

Virtual connections can be leveraged to extract/import data into data.world. Any data imported into data.world will often benefit from query performance improvements if it is in a small or medium size dataset (i.e., under 3GB). We recommend larger amounts of data be separated into multiple datasets, or accessed via virtualization instead. Data in data.world datasets is kept securely in encrypted data stores.

Note

Some customers choose to disable the ability to extract and import data to prevent any data from being persisted in data.world for compliance or other reasons. The decision to allow or disallow extract/import is fully in the hands of the customer.

Virtual query

Datasets in data.world are built off of the connections, but no data is stored in them. When a user performs a data query, a short-lived connection fetches only the query results. These results are NOT persisted--neither in storage nor in memory. Additionally, by default, queries have a 5 minute timeout for performance and security reasons. Finally, all queries are comprehensively tracked and audited via query audit logs that customers can access and monitor.

A virtual appliance (or optionally a hardware appliance or reverse SSH tunnel service) is implemented in the customer environment by the customer. The virtual and hardware appliances are powered by our partner Trustgrid, a secure connection bridge technology vendor trusted by many of the largest banks and financial institutions. Using the Trustgrid technology, a network bridge is configured on the data.world side in conjunction with the customer. At that point, read-only system user credentials can be securely stored in data.world as connections, and encrypted connections are initiated as outbound only. Connections in the system are encrypted and initiated as outbound only.

Understanding permissions

Permissions on a dataset or project are initially set when the resource is created. If an organization is set as the owner, then permission options are:

  • No one

  • Everyone in the organization

  • Public to the data.world community

New_dataset_permissions_org.png

Note

One safeguard against users accidentally publishing enterprise data out to the wider community is our standard enterprise team publication configuration: By default ‘Create public datasets’ is turned off for our Enterprise customers.

Owners of datasets and projects can invite specific users to contribute, or approve incoming requests from users who want to contribute. Either way, the owner controls what each contributor can do by granting three levels of permissions:

  • View only

  • View + edit

  • View + edit + manage

Datasets have another layer of access permission as they can be flagged as Discoverable. More about this kind of access in the section Discoverable datasets.Here's what each permission level will allow a contributor to do:

View only: primarily used for private datasets and projects, this allows the user to simply view the dataset or project. As part of that, the contributor can:

  • Download any of the files.

  • Query the data and export results.

  • View and comment in either public or private discussions.

  • Create new discussion topics.

View + edit: in addition to the view-only permissions, the contributor can:

  • Make edits to descriptions and summaries.

  • Add and remove tags.

  • Add and remove files.

  • Replace files by uploading new versions with the same name.

  • Modify file and column descriptions.

  • Modify license type.

  • Switch the dataset or project between open and private.

  • Publish queries for others to use.

View + edit + manage: The contributor will have full admin controls to the dataset or project. In addition to the view + edit permissions, they can:

  • Delete the dataset or project.

  • Add, remove, and modify contributors.

Manage your organizations, roles, and users

Organizations are the central “group” unit in data.world, and can be nested for ultimate control and flexibility over access control. An organization is made up of several different types of users:

  • Organization admins are your data domain stewards / governance managers or other personnel who should have direct management overall content in that org.

  • Individual resources can have owners assigned which may be more specific, “ground-level” data stewards.

  • “Discover” level access provides high-level metadata, but requires that a user request access to the resource before seeing additional information or being able to query the data. For more details on discoverable access, which is a very powerful feature and what enables agile data governance, see:

Note

Organization admins can see all org-owned datasets regardless of sharing configuration.

How link sharing works

One of the powerful features of our platform is that results from queries in a project can be reused or embedded. These links are not discoverable.

When a link to the results of a query is created, it is encoded with the user token information for the user who originally ran the query. Every subsequent running from that link also runs with the original user's permissions and token. As further security however, even with the link, access is scoped and limited to the specific results of the query. Finally, in VPC deployments share URL's expire after 12 hours.

Connection security

There are a variety of connection solutions available for configuring data.world access to remote data sources, and there are several factors to consider when determining the best way to connect. Some connection types are simpler to set up, while others offer enhanced security characteristics but require the involvement of internal IT to deploy software and/or hardware. The connection types available are:

  • Direct connection (inbound)

  • SSH tunnel (inbound, preferred)

  • Bridged connection appliance (outbound)

Direct connection (inbound)

When a database is accessible on the open internet from a known hostname or IP, you can simply connect to it directly from data.world. You can configure the connection either with the Connection manager, or through the Integrations gallery. This solution is perfect for online or SAAS products like Snowflake or when there is a desire to test virtualization capabilities.

Note

When a database is located inside of an organization's network, this is not the preferred solution. Opening database servers to the open web can pose a security risk and should be avoided. If you are considering this solution, you should block traffic from all hosts except for those from a specified allowlist.

SSH tunnel (inbound, preferred)

When connecting to a database server, data.world allows you to optionally configure an SSH tunnel to connect through. This solution requires some setup on the part of the network administrator, but is more secure than a direct connection.

This is the preferred method for connecting database servers to data.world. It is easy, flexible and secure. No additional hardware is required beyond a bastion server. Many organizations have these deployed as a normal part of their infrastructure.

For additional security, data.world provides user specific SSH public keys which should be configured on the bastion server (in .authorized_keys) to ensure traffic is from data.world.

advanced_connection_screen.png
Bridged connection appliance (outbound)

With some organizations, any incoming connection at all is considered insecure. This is especially true in industries which deal with health or financial data. In these cases, consider an alternative architecture which does not require the organization to accept connections from the open web.

A bridged connection involves deploying an appliance inside of the organization's network—where the target database servers reside. The appliance makes outbound connections at startup, and maintains these connections over time.

data.world has partnered with Trustgrid (http://trustgrid.io) to provide this capability. While Trustgrid helps configure and maintain the connection to the data.world network, data is visible only to data.world. No data enters Trustgrid's network.

This solution requires a commitment of time and resources on the part of the organization's IT department. They must work with data.world to deploy and configure the appliance. Ongoing maintenance of the appliance should be minimal— requiring time only in exceptional cases.

The appliance runs inside the organization's network and makes outbound connections to:

  • data.world "Data Plane" – data transfer only

  • Trustgrid "Control Plane" – configuration only

  • The target database instance(s)

Note

There are no inbound connections to the organization from the open internet with this solution.

Appliance requirements for bridged connections

A hardware or virtual appliance may be deployed. If you meet minimum requirements outlined below, virtual appliances are preferred, as hardware deployments require additional time and physical (rack) space.

  • Hardware: 2vCPU, 4GBRAM, 32GB Storage (supports 250Mbps of throughput)

  • Hypervisor/Virtualization: VMware vSphere5.5 (or greater), Amazon Web Services, Microsoft Azure Cloud, and Google Cloud Platform

Configuration

Trustgrid appliances are stateless network devices. They require specific vSphere configuration settings to maximize uptime:

  • Set the DRS level for a VM to either PartiallyAutomated or Disabled

  • Create an anti‐affinity rule

  • Do not backup Trustgrid virtual appliances

  • Deploy Trustgrid secondary high-availability appliances on separate physical hosts

Note

High availability deployments require two instances to be deployed and running.

Integrations and security

Integrations in data.world are primarily used either to bring data in for querying or for metadata analysis, or to take data out and work with it in a third-party application. Security for both types is comprehensive. In structure, integrations are stored as datasets in data.world.

Integration access

At the core, every integration uses some form of a user token to ensure that users only have access to the data that they should have access to. In the case of integrations used to download data into third-party applications, this token can be created via an OAuth flow or may involve the user copy/pasting a token that they copied from their advanced settings page.

In the case of database integrations, where we connect out to the system, we use the credentials entered by the Data Engineer. For Athena, the data engineer must also configure their AWS instance to allow us to connect.

Integration visibility and permissions

The integrations available to users are presented on an integrations web page:

Integrations_page.png

Organizations in the multi-tenant environment have all the publicly available integrations as well as any private integrations that they have specifically created for their organizations. Access to private integrations can be set by an organization admin.

Organizations currently using a VPC have implementations that come default with NO integrations on their integrations page. This configuration enables them to set visibility and access very flexibly. All permission levels and access can be set within the platform by the data administrators in the organization.

Integration architecture

data.world is a cloud-first, SaaS solution. However some use cases and capabilities either can be or must be implemented in your own compute environment, dependent on security or infrastructure operations requirements.

This document outlines the architecture design of those customer hosted components, and how to implement them.

Servers and permissions

Metadata collectors can be run on the same server. If you do so, ensure you have enough combined compute, storage, and networking resources to meet peak usage based on your scheduling. For example, if you plan to schedule 10 complex collection tasks to run simultaneously from the same instance, it is recommended that you increase available resources 3-5X, and monitor the resource usage on that instance for further optimization.

We recommend the data.world Bridge and technical lineage server be implemented on separate instances from metadata collectors to minimize resource interference and maximize network security control. Specifically, MANTA recommends, for the technical lineage server, that a “dedicated machine is recommended for MANTA to avoid collision for resources and limit access of MANTA to other data and applications for security reasons.”

Make sure the services have the proper permissions to allow network connectivity:

  • Metadata collectors must be able to connect to the source systems like databases and BI tools that you intend to collect metadata from.

  • The metadata collector you are using for lineage collection must be able to connect to the technical lineage server (MANTA) in order to pass some lineage information to data.world.

  • If you intend to automate the upload of metadata to data.world, metadata collectors must have permission to do an outbound connection to api.data.world via HTTPS on port 443.

  • The technical lineage server also must be able to connect to the source systems like databases and BI tools that you intend to collect metadata from.

  • The technical lineage server does not need to have outbound internet access, unless you intend to make the technical lineage server UI available via the public internet. However the technical lineage server does have it’s own UI, which your technical users will benefit from being able to use directly. The UI port should be made accessible for others in your enterprise to reach, and optionally it supports LDAP/AD integration for simplified, consolidated user administration.

  • The data.world Bridge can be a hardware appliance, but most often is leveraged as a virtual appliance. These appliances are provided by data.world. The data.world Bridge does not need to be able to connect with the on-premise metadata collectors nor the technical lineage server. It is only focused on secure brokering of data.world hosted metadata collectors (noted in DWCC hosted by data.world below) and data.world data virtualization / federated query capabilities.

More details are available in the supplemental technical lineage server (Manta) and data.world Bridge documentation articles.

Metadata collectors

Metadata collectors are used to connect to source systems such as databases and BI tools, and collect and generate useful information that should be in the data.world catalog. The most commonly used one that we provide is called DWCC (the data.world catalog collector).

If the technical lineage server (Manta) is in scope of the customer solution, DWCC is used to collect metadata from Manta and pass it to data.world. This is the method in which data.world receives information from Manta (Manta does not connect to data.world directly).

DWCC hosted by data.world

By default DWCC is run in the cloud, fully managed by data.world as part of the Connection Manager using tasks. If you have the data.world Bridge set up, it will leverage your bridge connection just like the data virtualization / federated query engine can. See our article on tasks for more details.

Connection_manager_bridge_connections.png
Create_a_task_dwcc_hosted.png
DWCC self-hosted by customer

However, for security or infrastructure operations reasons, you may opt to use DWCC in your on-premise compute environment behind your own firewall. By doing so keep in mind that you cannot use the Connection Manager UI -- rather instead you will leverage the command line developer user experience for DWCC, which is documented in detail on our help documentation.

DWCC ships as a Docker image which can be loaded and run with a series of command line (CLI) options. It outputs a file with the extension *.dwec.ttl that you can upload to data.world manually, or you can have the catalog collector upload it automatically using an API token.

Prerequisites
  • The computer running the catalog collector should have connectivity to the internet or access to the source instance, a minimum of 2G memory, and a 2Ghz processor.

  • Docker must be installed. For more information see https://docs.docker.com/get-docker/. If you can't use docker, we have a java version available as well -- contact us for more details.

  • The user defined to run DWCC must have read access to all resources being cataloged.

Installing the collector
  1. For the latest version of the DWCC open a command line interface (e.g., Terminal on Mac) and enter: docker pull datadotworld/dwcc

  2. For a specific version of DWCC, enter docker pull datadotworld/dwcc:X.XX where X.XX is the version number you want to pull.

Handling custom certificates

If the target data instance has a custom SSL certificate, we recommend extending our Docker image and installing the custom cert like this (where ./ca.der is the name and location of the cert file).

Dockerfile:

FROM datadotworld/dwcc:x.y
ADD ./ca.der ca.der 
RUN keytool -importcert -alias startssl -cacerts -storepass changeit -noprompt -file ca.der

Important

Do not forget to replace x.y in datadotworld/dwcc:x.y with the version of DWCC you want to use. E.g., datadotworld/dwcc:2.345.

Then, in the directory with that Dockerfile:

docker build -t DWCC-cert

Finally, change the docker run command to use DWCC-cert instead of DWCC.

Automatic updates to your metadata catalog

Keep your metadata catalog up to date using cron, your Docker container, or your automation tool of choice to run the catalog collector on a regular basis. Considerations for how often to schedule include:

  • Frequency of changes to the schema

  • Business criticality of up-to-date data

For organizations with schemas that change often and where surfacing the latest data is business critical, daily may be appropriate. For those with schemas that do not change often and which are less critical, weekly or even monthly may make sense. Consult your data.world representative for more tailored recommendations on how best to optimize your catalog collector processes.

Technical lineage server MANTA

The technical lineage server, powered by MANTA, connects to source systems like databases, ETL tools, and BI tools, analyzing SQL and other code and configuration assets where applicable, and generates holistic, cross-system technical data lineage. The technical lineage server is always self-hosted by the customer, similar to self-hosted DWCC.

This article provides a high-level summary of prerequisite information. For official information and additional details, see the supplemental documentation “MANTA Flow Installation and Usage Manual”.

Technical lineage server (MANTA), also known as MANTA Flow, is made up of three major components:

  • MANTA Flow CLI—Java command line application that extracts all scripts from source databases and repositories, analyzes them, sends all gathered metadata to the MANTA Flow Server, and optionally, processes and uploads the generated export to a target metadata database

  • MANTA Flow Server—Java server application that stores all gathered metadata inside its metadata repository, transforms it to a form suitable for import to a target metadata database, and provides it to its visualization or third-party applications via API

  • MANTA Admin UI (runs on the MANTA Flow Service Utility application server)—Java server application providing a graphical and programming interface for installation, configuration, updating, and overall maintenance of MANTA Flow (detailed documentation can be found on the page MANTA Admin UI)

A dedicated machine is recommended for MANTA to avoid collision for resources and limit access of MANTA to other data and applications for security reasons.

Minimum Configuration

CPU - 4 cores at 2.5 GHz

RAM - 12 GB

HDD - 500 MB for MANTA installation + 50 GB space for metadata; minimum 150 IOPS

Recommended Configuration

CPU - 8 cores at 3 GHz

RAM - 24–256 GB

HDD - 500 MB for MANTA installation + 100–300 GB space for metadata; SSD, minimum 2000 IOPS

Software Requirements

OS- Windows 7/Server 2008 or newer, Linux or Solaris, Mac (without installer)

Java - JRE (Java Runtime Environment) version 8 (update 151) or higher (for a 64-bit architecture) is a prerequisite for MANTA. JRE is NOT part of the MANTA package. It is the sole responsibility of the customer to obtain JRE. The versions tested by MANTA are OpenJDK and Oracle. Each distributor has its own licence conditions, and it is up to the customer to fulfill those conditions; MANTA is not in any way responsible for the customer ́s compliance with the required conditions.

User Access Requirements - Dedicated OS user with limited privileges under which MANTA will run. This is not required but highly recommended to limit access of MANTA to other data and application for security reasons.

Java Installation Instructions - Every MANTA product requires Java installation, specifically Java Runtime Environment (JRE). JRE version 8 update 151 or higher (for a 64-bit architecture) is a prerequisite for MANTA.

Installing the technical lineage server and configuring and running your first lineage analysis

For detailed instructions on how to install the technical lineage server (MANTA) and run and schedule your lineage analysis, see the supplemental documentation file “MANTA Flow Installation and Usage Manual”.

data.world bridge

The data.world Bridge provides a customer-hosted connectivity hub where connections are outbound-only, short-lived, and highly secure. It is powered by Trustgrid, and trusted by banks, healthcare companies, and many other organizations with very strict security and compliance needs.

Customer-hosted metadata collectors and the technical lineage server do not require inbound connectivity, however data virtualization / federated query and data.world-hosted metadata collection does. The data.world Bridge enables a secure, outbound-only method for brokering those connections.

The data.world Bridge is provided by data.world, and can be implemented as a hardware appliance or a software virtual appliance (image support for AWS, vSphere, Microsoft Azure, and Google Cloud Platform).

High-level architecture

data.world handles the Application, Gateway Node, and Cloud Management aspects. Customer responsibility is only the Edge Node(s) and connection to the Database (which represents the databases, BI tools, etc. that is desired to connect to data.world).

We recommend a high availability (HA) implementation, and will provide up to 2 nodes for customers if they are using the data.world Bridge.

dataworld_bridge.png
Network

If you firewall outbound traffic to the internet, or if there will be a firewall between the nodes and your servers, please be sure that the necessary firewall rules are in place to allow traffic from the data.world Bridge. Please contact support with any questions regarding the necessary firewall rules.

If your environment involves multiple VLANs, prepare the desired switch ports to allow access to the correct VLAN.

Standard (one node)

1 Public IP Address

1 Private IP Address

2 DNS Servers

HA (two nodes)

2 Public IP Addresses

2 Private IP Addresses

1 Private IP Address for Cluster IP

2 DNS Servers

Recommended Software Requirements

4 vCPU (or equivalent)

4 GB RAM

30 GB disk space

Installing the data.world bridge

For detailed instructions on how to install the data.world Bridge, see the supplemental documentation materials.

Enhanced security and privacy features

In addition to requiring a single sign-on (SSO) to access your organization’s data, organizations on data.world can have additional enhanced security and privacy features enabled to prevent the accidental sharing of confidential information. In this article we’ll cover:

  • Enhanced security and privacy features

  • How to get the features enabled or disabled

  • Support implications

Features

The following features are available to be enabled per organization on data.world:

  • Disable Open Datasets and Projects - This feature disables the option to make a resource on data.world “Open” for our wider community. The option will not be presented as a setting for any resource, and the default privacy will be “Private” for all resources.

  • Disable Guest Authorizations - Requires any contributor to a dataset or project to be an existing member of your organization. This prevents a dataset admin from adding the wrong “John Smith” to a resource, for example. Combined with SSO, this feature makes it so only proven organization members can receive invites to resources.

  • Disable Share URLs - Removes the ability to create “Share URLs” of resources or data on data.world. Typically these URLs are used for the easy sharing of data without requiring the recipient of a share URL to be logged into data.world.

  • Disable Organization Member Invites - Requires that all organization members be provisioned through SSO. This feature disables the ability to invite community members directly to your organization. Any and all organization members would have to join the organization through your SSO provider, such as clicking on the application link or tile in most SSO provider’s home pages, like Okta or Azure ADFS.

How to get features enabled or disabled

Email support@data.world to enable any of the above listed features for your organization.

Support implications

Enabling any or all of these features can change the typical experience of using data.world. For example, Disable Guest Authorizations will prevent our support and solutions staff from having direct access to a resource, and Disable Open Datasets and Projects will prevent you from publishing an Open dataset on data.world under your organization’s namespace, even if the data was meant to be Open.

Security best practices

There are several best practices you can follow to improve the security of your data and manage access to it on data.world.

Use organization-owned connections

The Connection Manager on your organization page allows for connections to be managed by only organization administrators. All database and dataset connections are audited and reportable.

Leverage identity integration

Integrate with your identity management system, such as Okta or Ping. data.world supports pre-provisioned accounts with SAML authentication, or just-in-time (JIT) SAML provisioning. Your identity management system will provide you the ability to manage token expiration, password policy, multi-factor authentication, conditional access restrictions, and more in conjunction with your data.world solution.

SAML

Use an SSO application with your provider to authorize access to your organization's data.

Turn off organization visibility

By default, organizations in a VPC environment do not show up in a list of data.world organizations. However this feature is also availbe for multi-tenant clients. It is possible to configure any organization so that it does not show up in a publicly visible list of data.world organizations.

Never share keys or tokens

Some third party applications may require an API token or key to work with data.world. If you have such a key or token, or one for data.world's metadata catalog collector, you should never share them with anyone else. These tokens run as your user with your permission levels. Every user who needs an API token should have their own for security and accountability.

Upload restrictions

Uploads can be restricted, including to 0GB (uploads disabled) to prevent data being manually added to the platform by users.

Provide masked/limited file previews on discoverable datasets

Often for evaluating data you want users to understand not only the column names and other descriptive metadata, but also some example rows. Masking/limitations applied to samples allow for them to be provided in a way that effectively works within sensitive data or compliance needs.