Docs portal

Store data: datasets

What is a dataset?

A dataset is the basic repository for data files and associated metadata, documentation, scripts, and any other supporting resources that should be stored alongside the data. Datasets are where all data is stored and documented for later sharing and use in projects. They are the building blocks for projects. They contain data and metadata related to a topic. The files and tabular data in a dataset are meant be used--queried and analyzed--in one or more projects. Datasets are meant to be reusable resources. They can be combined with other datasets in projects, or they can be a single source for querying and analysis in a project.

Datasets can be owned by an individual or an organization, and a dataset provides an additional layer of access permissions to the data in a project. Because permissions are assigned at both the dataset and the project level an individual can create a project available to the public, but if the individual adds any datasets owned by an organization to the project, they won't be visible to the public--only to the other people in the organization. Not only is the dataset not visible, but any queries in the project written against that dataset are also not visible except to members of the organization.

Because datasets are linked to projects, any changes to the data or the metadata in the dataset show up automatically in the linked project. Linking data to a project instead of copying it into the project means that everything is kept up to date throughout your organization.

When to use a dataset and when to use a project

Generally if you are putting up data to share or data that is private but which you might conceivably want to reuse in other projects, it's better to add the data to a dataset. If the data is in a dataset, all of its metadata will automatically show up in your project because the dataset is linked instead of copied. All changes to the original dataset--including automatic updates from the source and manual updates by the dataset owner to the metadata--will also be conveyed.

The table below summarizes the differences between adding data files to a dataset vs. to a project:

Dataset vs. Project

dataset

project

Can run and save queries against

X

X

Can have charts/visualizations

X

Can incorporate different file types

X

X

Can contain multiple files

X

X

Can be shared/have contributors

X

X

Can have a discussion thread

X

X

Can include insights

X

Can use existing data.world datasets without having to download and reimport them and having to recreate the associated meta-data

X

Can be included in a project

X

Can be shared for others to use in their own datasets and projects

X

Create a dataset

When you create a dataset it might be because you have a database or other tabular data that you want to analyze and share. But data from a database isn't the only kind of data you can put in a dataset. Any file type can be saved there. Check out our article on supported file types for Information about various file types and the ways they are handled.

There are several ways datasets can be created:

  • Manually - we'll walk through that here

  • Via our API - instructions available in our API docs

  • Through super connectors like Stitch, KNIME, Knots, and Singer - instructions can be found in our integration documentation under super connectors

  • With Sparklebot - for data portals or enterprise companies, contact data.world to find out more about our tools to automate creation and syncing of your data resources. This can be full data and metadata mirroring, or simply a catalog of your data sources with metadata and sample data where you'd like.

+New dataset

While logged in to data.world click on +New in the upper right corner of your window to create a new dataset and you'll be prompted to choose either a dataset or a project:

Image_2019-09-24_at_10.18.15_AM.png

Choose Create new dataset and you'll be prompted to name the dataset, and set the ownership and, accessibility. If you are in one or more organizations, by default the owner field will contain the name of one of the organizations you are in. You can also set the owner to be yourself or any of the other organizations you are in by selecting the dropdown on the owner filed:

Image_2019-09-24_at_10.22.58_AM.png
Dataset owner and permissions

If the dataset is intended to be used in the organization, it should typically be created with the organization as the owner. In this way the dataset benefits from the organization's service tier, permissions can be easily set based on the members of the organization, and datasets remain available within the organization even as individuals and permissioning changes. Permissions on a dataset owned by an organization can either be set to No one or to everyone in the organization:

Image_2019-12-02_at_11.21.20_AM.png

If you are not in any data.world organizations, you will automatically be set as the owner of the dataset, and you can choose to keep the dataset private or to share it with the data.world community:

Image_2019-09-24_at_10.27.25_AM.png

The number of private datasets you are allowed is determined by your user license--you can create as many public datasets as you would like. More information on account types and pricing are found on our pricing page. There are several factors to consider when deciding whether to make your dataset public or private:

  • When you make a dataset public you allow others to use that dataset in their own projects and build from it. They can't change your dataset in any way or even save queries to it, but they can use and share it.

  • Data that is public on data.world can be downloaded from data.world and used externally. If your data is proprietary or sensitive, it shouldn't be shared.

  • Publicly shared datasets add to the amount of information that is available to everyone for analyzing, visualizing and learning from

More information on permissions can be found in the article Understanding permissions.Setting permissions

Whatever the permissions are set at for the dataset will also pass through to any projects that use the dataset. So if the dataset is shared with no one then only you will be able to use it in a project, and if the project in which you include it is open to everyone, no one else will be able to see that dataset. Permissions can always be edited at a later time. After you create your dataset you can document your objective for it, add data to it, or continue on to the overview.

Screen_Shot_2018-12-06_at_2.10.18_PM.png
Crowdsourced datasets

In data.word datasets and projects can be owned by individuals or organizations. They can be private, shared with an organization, or shared with the public. With the crowdsourcing feature individuals can even set the ownership of a dataset or project to an organization that they don't belong to as long as that organization is configured to accept ownership proposals. In this article we'll cover:

  • Configuring an organization you administer to accept dataset proposals

  • Setting ownership of a resource to an organization you don't belong to

  • What happens after an ownership proposal is made

Configuring an organization you administer to accept dataset proposals

On an organization's page in data.world there is a settings tab where administrators of the organization can set preferences for membership in the organization and whether it accepts datasets proposed by individuals outside the organization:

Screen_Shot_2020-03-05_at_11.39.20_AM.png

If the organization is configured so that it accepts dataset proposals the projects and datasets can be created for this organization by any community member and will be subject to admin approval. Proposed resources will count towards the organization’s resource limit. Once the organization hits the limit, users will not be able to submit proposals and will see a message saying, “This organization is not accepting datasets or projects right now.”

Setting ownership of a resource to an organization you don't belong to

When you create a new dataset or project, one of the ownership options you can choose is any organization that is accepting proposals. On the Create a new dataset dialog there is an option to see what organizations are accepting proposals:

Screen_Shot_2020-03-05_at_11.54.53_AM.png

When you follow the link you will be able to search for an organization by name or select from a list:

Screen_Shot_2020-03-05_at_11.55.30_AM.png

After you select the owner it's a good idea to enter a description and upload or link data so that the organization admin

Screen_Shot_2020-03-05_at_11.56.57_AM.png
Screen_Shot_2020-03-05_at_11.57.17_AM.png
What happens after an ownership proposal is made

When you propose a dataset or project to an organization it is only visible to you and the org admin(s) until it has been approved by the admin(s) and shared. If the admin(s) choose not to share it, it will remain visible only to you and them--no one else in the organization. You will also no longer be able to change the access privileges (make it discoverable, or publicly available) or invite anyone else to contribute to it.

How to get your data into data.world

There are several ways to get your data or metadata into data.world and there is no one right choice. There are benefits to each method, and which you choose will depend on several factors:

  • What format is your data in?

  • Where is your data currently located?

  • What is the size of your data?

  • How often does it update?

  • What are you going to do with it?

Databases or data warehouses

If you work with a JDBC-compatible database or warehouse, you can use our API to directly sync your data into a dataset. Below is a view of the databases supported by our API, and a current list can be found on our database integrations page.

Screen_Shot_2019-04-03_at_4.33.40_PM.png

If you wish to leave your data at rest in your existing database or data warehouse, whether it’s on-premises or in the cloud, our Enterprise Tier supports virtualized access to that data source.

Local files

The best way to get files from your computer onto data.world is to upload them directly into a dataset. Files uploaded from a computer cannot be automatically synced or updated, but you can manually push new versions up to your dataset, replacing the previous version, as needed. When a new version is uploaded, the older version is still available for auditability and versioning. More about uploading data from local files and versioning can be found in our article Adding data files.

Cloud-based storage

Documents that are stored in cloud-based storage services (e.g., in Google Drive, Box, Dropbox or Amazon S3) can be easily added to data.world with one of our integrations and set to sync so that they update automatically:

Screen_Shot_2019-04-03_at_9.47.02_AM.png

As with manual updating, versions of files that are automatically updated are also kept for reference. More information about adding cloud-based files can also be found in the article Adding data files.

Excel spreadsheets

For Excel spreadsheets, data.world has created a specific add-in that's available on the AppSource or from within Excel. The add-on allows you to work with your data in Excel while at the same time sharing it in a dataset with others who may not have or use Excel:

Screen_Shot_2019-04-03_at_10.06.17_PM.png

See our Excel integration page for more information. Of course if you so choose you can always either upload your Excel spreadsheet into a dataset like you would any other file type, or you could put in a cloud service like Google Drive, Box or DropBox and add it to the dataset there so it can automatically sync between the two. Versions of Excel files that are uploaded or synced are also kept for future reference.

Data from real-time sources via streaming

You might have data that updates in real-time that you would like to put on data.world. This data could be something like log files, test metrics or tracking data. The best way to integrate this data into a dataset is to use data.world's streaming API. Unlike the methods previously mentioned which pull data from the source on a regularly scheduled basis, data brought in through the streaming API can be pushed into a dataset based on a change to the original data. Because it's triggered by data events and not random time intervals, using the streaming API is the best way to manage real-time data. You can read more about streaming in our API Quickstart guide.

For those less comfortable with working directly with an API, data.world also integrates with several superconnectors like IFFTTT, KNOTS, Singer or Stitch. While easier to use, they are less flexible and versatile than our own streaming API. You can see a full list of our superconnector integrations on our superconnector integrations page.

Screen_Shot_2019-04-03_at_3.47.52_PM.png
Data via a URL or RESTful API

Another common source of data is from a URL or RESTful API available on the internet. If you have a Google Sheets doc, e.g., you can add it to a data.world dataset. As long as the data is on a site that's publicly accessible, you can sync it to data.world--even if it's on a password-protected site with data.world's option to add from a URL. Detailed instructions for adding and syncing data from a url can be found in the article Adding files from a URL. If you do not own the data from the web that you'd like to bring into data.world, you can find out more about licensing and data in the article Licensing and data you found.

If you have data that is behind an API that you'd like to put on data.world--e.g., data from Salesforce, Facebook Ads, Google Ads, etc.--the best way to get it into a dataset is to use one of the superconnectors shown above. More information about our sales and marketing app integrations can be found here.

On-premise data

In addition to data that is available to data.world via cloud sources or APIs, some data that you might want to make accessible on data.world might only be available on your corporate network or behind a firewall. For customers with a need to catalog data behind a firewall, we make our Virtual Data Connector available as an appliance that can be hosted at your site and communicates with data.world via a secure bridge protocol. This option is available to our enterprise tier customers. If you have this need, please contact our sales team at sales@data.world and they will help you with your options.

Verifying your data with data inspectors

When you ingest a tabular data file on data.world it is run through a series of inspections to validate both the structure and content of the data in the file. If issues are found, the file is flagged with a warning. Warnings are indicated by either a yellow triangle or a red circle, depending on the severity. The warning flag can be found on the dataset overview page under the name of the file:

Screen_Shot_2018-12-30_at_4.25.37_PM.png

or on the About this file section as Inspections on the dataset or project workspace for the file:

Screen_Shot_2018-12-30_at_5.39.17_PM.png

The number of warnings is listed to the right of the flag. By far the most common, yellow triangles are there to alert you to potential problems with the data that might affect your ability to query it, or warn you that sensitive data (social security numbers, phone numbers, email addresses, etc.,) was detected.

Very occasionally you will get a red flag which indicates that there was an error on ingest and data from the original file was lost. Possible reasons for the loss of data include:

  • The original file was corrupt.

  • There was a data type mismatch between the data type identified for the column and the data stored in it.

  • Data that you choose to connect to a specified linked data class had values that didn't match the linked data.

For a list of all the inspection warnings and errors, see the article Data Inspectors.

Whether you get a yellow warning or a red error, you have the option to correct it or ignore it. If you get yellow warnings, click on the flag for the warning dialog box view the warning types and locations. The dialog groups the errors by type so you can review them one kind at a time. Each type of warning is labeled with what kind of issue it is, how many were found, and the location of each. Some flags are for issues you already know about and don't wish to fix. Those warnings you can simply dismiss:

Screen_Shot_2018-12-30_at_4.30.22_PM.png

Note: Once you have dismissed a set of warnings it will not show up in the file again even--if you delete and reimport the file or update it. The ONLY way to get a list of all the warnings back is to delete the file and ingest it again with a different name.

If you wish to correct the issues with files that were originally added to data.world by a direct add, you can:

  1. download the file from data.world

  2. make the corrections (the locations in the warnings will help you find them)

  3. re-upload the file using the same name - by using the same name, you'll overwrite the file on data.world (as opposed to creating a new file, which would occur if you changed the name)

For files that are synchronized from external services (such as cloud storage services), you will need to:

  1. update the file in the source system

  2. either select the Sync now button from the details window:

    verifying-your-data-1.png

    or

    from the workspace, choose the Sync now button on the right sidebar:

    verifying-your-data-2.png

Sometimes changes that you make to the data dictionary will cause error warnings in the data. In the example below, one of the columns in the file being ingested holds the ages of shark-attack victims. Some of the values in the column are "20's", "30 or 40", etc. If I wanted to to restrict the data being imported to only integers so I could use arithmetic functions on it, I could go in to the data dictionary for the file after import and set the column type from string to integer. Doing this would immediately cause a red flag on the inspections as some of the data would be left out on re-ingest due to a datatype mismatch:

Screen_Shot_2018-12-30_at_6.11.05_PM.png

For a list of all the inspection warnings and errors, see the article Data Inspectors.

Document your data

Once you have created a dataset or project and added your files to it, you can make it easier to find and more useful to others by describing, or documenting it. Documenting consists of creating the metadata for your dataset or project and helps others to trust your data and work. Searches on data.world also look at titles, descriptions, summary, and tags to match search strings so the more completely you describe your data the more chance it has of being found.

The starting point for describing your data is the dataset or project overview page. From here you can edit the description, create the summary, assign tags, set the licensing, and complete the data dictionary:

Screen_Shot_2018-09-24_at_5.35.11_PM.png
Description

Datasets, projects, all the files in each, and all the columns in any structured data files have description fields associated with them. Descriptions are very short and serve as a quick reference for the item they describe. To edit the description for a dataset you can select Edit next to the description, Edit next to About this dataset, or navigate to the Settings tab:

Screen_Shot_2018-09-24_at_5.35.11_PM.png
Summary

The summary is one of two documents created with a dataset or project. The summary is where all of the information about the origin of the data, why you created the dataset, further documentation of your work, etc. is found. Use the Summary section to tell your data's story. For example:

  • Where did the data come from? Cite and link to your sources or include your details for a 'citation request'. Not only does this give credit where credit is due, but it helps other people evaluate the data's suitability for their needs.

  • If you think a particular piece of context will be useful to others, add it.

  • The best summaries cover the "who, what, where, when, why, and how" of the data.

  • What's the data telling you? What would others be interested to know about it? What have others found using this data?

  • If the data has associated data dictionaries or other documentation, upload it and then link to it from your Summary.

  • Summaries are created and edited in either the data.world Simple Editor or in Markdown.

Organizing a dataset with file labels

When a dataset contains many files, determining the purpose of each file can be difficult without examining it more thoroughly. By adding file labels, you can see each file's category at a glance. The following file labels are available:

  • raw data

  • clean data

  • documentation

  • script

  • visualization

  • report

You can add file labels from a dataset overview page by clicking the three dots icon on the right side of the file preview and choosing Edit file metadata:

file-labels-01.png

That will bring up a new window where you can add file labels and edit other file metadata:

file-labels-02.png

To update those labels later, just follow the same steps and add or remove them as needed.

Tagging

Tags are a powerful feature that you can use in a variety of ways to facilitate access to your data. For example, tags can be used to organize and group your dataset or project by topic, category, source, department, or team. They can be searched for explicitly with the tag search operator, and can also help to filter down more generic search results.

You can add or remove tags from a dataset or project's Overview page with either the Edit or Add tags links on the right side:tagging-1.png

There is no limit to the number of tags you can use for a dataset, and there is an autofill feature on the tag field. If the dataset is owned by an organization, the tags displayed for autofill are chosen from all the tags used by the organization. If the dataset is not owned by an organization, the autofill suggestions are from a generic list of tags as well as from tags you have recently created.

Data dictionary

The data dictionary contains all the metadata (data about the data) for the files, tables and columns in a dataset. For all files it contains:

  • The names of all the files in the dataset

  • A place to add descriptions for each file

  • The labels for each file

and for tabular files it has:

  • The column names

  • The format of the data in each column

  • A place to add a description for each column

You can get to the data dictionary either from the Overview tab (right below the Summary) or from the Documents section in the left pane of the workspace:

Screen_Shot_2018-12-07_at_3.48.48_PM.png
Screen_Shot_2018-12-08_at_2.47.46_PM.png

Data dictionary entries for each file are edited separately by selecting the Edit link next to the filename in the data dictionary document. Every file--no matter what type--has a data dictionary entry which contains the file metadata for the file:

Screen_Shot_2018-09-27_at_4.58.18_PM.png

Tabular files also have optional advanced settings and csv settings additional options in their file metadata:

Screen_Shot_2019-04-15_at_12.22.41_PM.png

The Authentication setting allows you to specify password, token, or OAuth parameters if the source URL requires authentication. The Headers setting is to specify options to modify the response from the URL, e.g., to specify a file content type. The Post body setting enables you to switch the request method from GET to POST if the source URL requires it.

The CSV settings section manages how your comma separated value format files are handled. To access it, select Show to the right of the section:

Screen_Shot_2019-04-15_at_12.37.52_PM.png

Tabular files also have a tab for columnar metadata in their data dictionary where you can rename the columns, change their format, and add descriptions for them:

Screen_Shot_2018-09-27_at_5.04.59_PM.png

Changing column names and adding a description is a great way to avoid the ambiguity that comes from having multiple columns with the same name. It also renders obscure column names understandable.

Changes to column names, descriptions, and data types propagate throughout data.world to every project that references the dataset, and the changes remain even if the data is updated from an external source.

Setting a license type

Setting a license type for a dataset is important to explicitly define how others may use the data. Licensing is determined by two factors:

  • The licensing of the source documents in the dataset

  • The wishes of the dataset owner

The general rule of thumb is that the most restricted license for the source material is the least restricted license that can be used for the dataset. However--existing source licensing again being taken into account--the owner of the dataset can choose even more stringent licensing for others who wish to use the dataset. To set the license type:

  • Go to the dataset overview page

  • Click Edit next to About this dataset on the right side of the screen

licensing-01.png
  • Choose the appropriate license from the Public license dropdown menu and save

licensing-02.png

When creating a dataset on data.world, use the following articles to help determine the license type you should use: