Enterprise docs

3rd party connectors

dw-jdbc is a JDBC driver for connecting to datasets hosted on data.world. It can be used to provide read-only access to any dataset provided by data.world from any JVM language. dw-jdbc supports query access both in SQL (data.world's dialect - see details in our data.world SQL tutorial) and in SPARQL 1.1, the native query language for semantic web data sources.

JDBC URLs

JDBC connects to data source based on a provided JDBC url. data.world JDBC urls have the form

jdbc:data:world:[language]:[user id]:[dataset id]

where [language] is either "sql" or "sparql",[user id] is the data.world id of the dataset owner, and [dataset id] is the data.world identifier for the dataset.

Sample code (Java 8)
final String QUERY = "select * from HallOfFame where playerID = ? order by yearid, playerID limit 10"
final String URL = "jdbc:data:world:sql:dave:lahman-sabremetrics-dataset";
try (final Connection connection = // get a connection to the database, which will automatically be closed when done
             DriverManager.getConnection(URL, "<your user name>", "<your API token>");
     final PreparedStatement statement = // get a connection to the database, which will automatically be closed when done
             connection.prepareStatement(QUERY)) {
    statement.setString(1, "alexape01"); //bind a query parameter
    try (final ResultSet resultSet = statement.executeQuery()) { //execute the query
        ResultSetMetaData rsmd = resultSet.getMetaData();  //print out the column headers
        int columnsNumber = rsmd.getColumnCount();
        for (int i = 1; i <= columnsNumber; i++) {
            if (i > 1) System.out.print(",  ");
            System.out.print(rsmd.getColumnName(i));
        }
        System.out.println("");
        while (resultSet.next()) { //loop through the query results
            for (int i = 1; i <= columnsNumber; i++) { //print out the column headers
                if (i > 1) System.out.print(",  ");
                String columnValue = resultSet.getString(i);
                System.out.print(columnValue);
            }
            System.out.println("");
        }
    }
}
Building dw-jdbc

dw-jdbc can be built from the command-line using mvn clean install.

Finding your Token

More resources for using Java/JVM/JDBC-enabled tools:

If you have a database hosted on Amazon Web Services(AWS) that is not publicly accessible, you can allow data.world to connect to it by using an SSH tunnel

In this scenario, you will launch a publicly accessible SSH server (sometimes called a bastion server) in the same AWS Virtual Private Cloud (VPC) as your AWS database.

You will then configure data.world to connect to your SSH server instead of directly to the database. The public SSH server will forward data.world's requests to the private database.

By using this type of connection, you can keep your database hidden from the public internet and instead rely on your SSH server to handle the security and access control for connections to that database.

Find your database's VPC and Port

Login to your AWS Management Console.

For RDS databases(including SQL Server, MySQL, PostgreSQL, or Oracle), click on Database>RDS in the center of the page.

Click on Databases on the left side of the Amazon RDS page.

Click the DB Instances link in the middle of the page to see a list of active databases.

Click on the name of the database in the DB identifier column to open its details page.

Within the Connectivity and Security section at the bottom of the page, take note of the following information:

  • Endpoint & port > Endpoint

  • Endpoint & port > Port

  • Networking > VPC

Create an SSH server on an Amazon EC2 instance

Go to the AWS Management Console main page and click Compute>EC2 from the center of the page.

In the middle of the page, select Launch Instance.

The Instance configuration requires the following steps:

  1. Choose AMI (Amazon Machine Image): Any Linux based AMI will be appropriate for this setup - SSH is the only program required.

  2. Choose Instance Type: This will be a low memory and storage application. Depending on your usage requirements, a free tier t2.micro instance may be sufficient.

  3. Configure Instance: ensure the following settings are configured:

    • Network: select your database's VPC from Task 1

    • Subnet: Choose a public subnet

    • Auto-assign Public IP: Use subnet setting(Enable)

  4. Add Storage: Accept defaults

  5. Add Tags: Accept defaults

  6. Configure Security Group:

    • Assign a security group: Select Create a new security group

    • Security group name: data.world to SSH

    • Description: Bastion server for forwarding requests from data.world to private database

      Modify the first line of the security rules to show the following values:

    • Type: SSH

    • Protocol: TCP

    • Port Range: 22

    • Source: Enter your public IP address followed by /32. You can use a Google search to find your IP

    • Description: My IP for configuration

      Click Add Rule and enter:

    • Type: SSH

    • Protocol: TCP

    • Port Range: 22

    • Source: 52.3.83.134/32

    • Description: data.world inbound connection

      Click Add Rule and enter:

    • Type: SSH

    • Protocol: TCP

    • Port Range: 22

    • Source: 52.205.195.10/32

    • Description: data.world inbound connection

      Click Add Rule and enter:

    • Type: SSH

    • Protocol: TCP

    • Port Range: 22

    • Source: 52.205.207.86/32

    • Description: data.world inbound connection

      AWS_task2_1.png
  7. Review: Verify the above settings were entered correctly and create the instance.

  8. When launching the instance, you will be prompted to select an existing key pair or create a new key pair. If you are using a previous key pair, you will need to have a copy of the key .pem file stored on your local computer from when you created the keys. If you're creating a new key pair, download that .pem file now.

    aws_keys.png
Enable forwarding from the SSH server to the database

Go to the AWS Management Console and select Compute>EC2.

Select Instance>Instances on the left side.

Find the instance you just created from the list in the main section of the page. Click on it to load its details in the lower part of the page.

You'll need to know the Public DNS and Private IP addresses shown for the next step - so keep them handy by creating a new, duplicate browser tab to complete the next set of steps.

AWS_task3_1.png
Create a new security group
  1. In your new tab, select Network & Security>Security Groups on the left side of the EC2 page.

  2. Click Create Security Group

  3. Enter the following values:

    • Security group name: SSH to your database name

    • Description: "allows traffic from the SSH server to database"

    • VPC: this is the same VPC used for the SSH server and database

  4. With the Inbound tab selected at the bottom of the window, click Add Rule

  5. A new row will populate in the list of rules. Enter the following values:

    • Type: Custom TCP Rule

    • Protocol: TCP

    • Port Range: the port number of the database

    • Source: Custom; in the blank box to the right enter the Private IP address with /32 added to the end

Add the security group to your database
  1. From the AWS Management Console, click on Database>RDS

  2. Select Databases on the left side

  3. Click on the link to your database in the DB identifier column

  4. On the database details page, click on the Modify button on the top right

  5. Scroll down to the Network & Security section; from the Security group drop down menu, add the security group that you created in the previous section (e.g. "SSH to your database name")

  6. Save the changes by scrolling to the bottom of the page and clicking the Continue button

  7. On the following page, choose when to apply the changes, then click Modify DB instance

Configure an SSH user for data.world to use to connect
  1. On MacOS or Linux, open a new Terminal window. For Windows, use an SSH client such as Putty or OpenSSH

  2. Within the terminal, navigate to the location where you downloaded the .pem key file you generated

  3. Set the permissions for the key file to be not publicly viewable, using the appropriate name if your key pair is different than our example of ssh_tunnel:

    chmod 400 ssh_tunnel.pem
  4. Connect to your AWS SSH server from the terminal using the default user for your EC2 instance. AWS will generate the command tailored to your specific instance - to find that command, you can navigate to your EC2 Instance details page and click the Connect button at the top.

    AWS_task4_1.png

    The general form will be:

    ssh -i "your key file.pem" <Your default EC2 user>@<Your EC2 Public DNS>

    For other options, please see additional guidance from Amazon at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstances.html

  5. Once you're connected to your server, create a user group named datadotworld:

    sudo groupadd datadotworld 
  6. Create a user named ddw:

    sudo useradd -m -g datadotworld ddw
  7. Switch to the user named ddw

    sudo su - ddw
  8. Create a hidden directory called .ssh to upload your public key, setting its permissions appropriately:

    mkdir ~/.ssh
    chmod 700 ~/.ssh
  9. Enable read and write permissions for the owner on that file

  10. Create an empty file called authorized_keys and provide read and write access to its owner:

    touch authorized_keys
    chmod 600 authorized_keys
  11. Get the public key value from data.world by doing the following:

    • Open a new browser tab

    • While logged into data.world, go to https://data.world/integrations/categories/database and click on the tile of the database type you'd like to connect

    • If that integration has not yet been enabled, click the Enable Integration button (otherwise go the Manage tab and choose Add new connection)

    • In the new window that opens, click the Advanced Settings tab and check the Use SSH Tunnel box

    • Copy the SSH public key

      AWS_task4_2.png
    • Leave this window open as you'll return to it to complete the configuration

  12. Back in the terminal, add the public key to your authorized_keys file with the following command:

    echo "<Your Public Key>" >> ~/.ssh/authorized_keys

    Include the quotation marks but replace <Your Public Key> with the key you just copied from data.world in the previous step

Configure the database connection in data.world
  1. Return to the database configuration tab on data.world that you opened in the previous step

  2. Within the Advanced Settings, enter the following values:

    • SSH host: this is the Public DNS(IPv4) value in your EC2 instance created in Task 2

    • SSH user: the name of the SSH user you created in Task 4 - that's "ddw" if you followed our suggestion

  3. Go to the General Settings tab and enter the following values:

    • Display name: your choice - this is how your database connection will appear in data.world

    • Host/IP: the Endpoint value

    • Port: the Port value

    • Connection username: a valid user in your database instance

    • Connection password: the database password for that user

  4. Test the configuration and save.

With your database configured, you can now use any of the Add data mechanisms to import data into a dataset or project on data.world.

KNOTS is a quick and intuitive visual ETL tool that allows you to do complex data replication with ease. Using the visual interface, you can now easily bring your data into data.world with the power of Singer taps and targets. Taps and targets are applications that can be combined together to create simple data pipelines. Taps extract data from a source while targets consume them.

Using KNOTS, you can import data from a number of datastores on an ad-hoc basis or you can export a fully-configured knot and run it with a job scheduler to make sure your data is always up-to-date. With the intuitive interface, you can configure robust data replication processes in minutes, and KNOTS is always free.

With KNOTS and data.world, you can:

  • Leverage ETL without complex manual configuration

  • Quickly bring data from various datastores into data.world

  • Pause and resume replication without any loss of progress

KNOTS is currently only available on MacOS. A Windows version is in the works; if you'd like to be notified when it's complete, please create a new ticket.

Getting Started

Download and install the latest version. For Mac, you'll want to use the DMG installer.

KNOTS currently support the following data sources:

  • Adwords

  • Amazon S3

  • Facebook

  • MySQL

  • Postgres

  • Redshift

  • Salesforce

Please let us know if there’s another source that you would love to see added.

Prerequisites

KNOTS depends on Docker being installed and running. Docker is a tool designed to make it easier to create, deploy, and run containers. Containers allow us to package up an application or library with all of its dependencies. Each individual tap and target is packaged into its own container with the correct set of dependencies to ensure they’re all easy to use.

The installer for Docker for Mac is available here.

NOTE: Check Docker file sharing preferences and make sure that "/Users" is a shared directory.

Running the app
  1. From the home screen, click on "Get Started", or "New knot" on the upper right-hand corner

  2. Select a tap to use from the list of available taps

taps.png
  • Provide the configuration values required by the tap

configure_tap.png
  1. Click on "Continue" to run the tap in Discovery mode and determine which tables/streams are available

  2. Select the tables/streams that you would like to sync

  3. Select the data.world target from the list. To configure it, select the dataset/project that will be used to contain your data, and include your API token.

  4. Enter a name for the new knot, and click on "Save & Run" to execute it

Once the process has finished, click on "Done" to return to the home screen. You should now see a list of your saved knots and the various actions that can be taken on them.

Knot actions:

  • Sync new data: Sync from the point of last run

  • Sync all data: Sync from the beginning

  • Edit: Modify the configurations for the tap and the target

  • Export: Downloads a ZIP file with the tap and target, and their configurations

  • Delete: Removes the knot

Run with a scheduler

As of today, KNOTS allows you to update the data by manually clicking on the Sync new data action. By making use of the Export action though, and the resources that it provides, it's possible to set up a job that will update the data automatically on a schedule.

The exported package is a ZIP file that includes the tap and target for your knot, as well as their configurations, and a Makefile. You can read more about Makefiles here, but the gist of it is that it contains all the necessary commands to run the knot and keep your data updated. Through the use of a tool like crontab, you can run the knot from your local computer, or it can be set up to be executed from a cloud service like Amazon Web Services or Heroku.

Support

If you run into issues or have questions, please submit a ticket to the data.world team.