Docs portal

Store data: datasets

What is a dataset?

A dataset is the basic repository for data files and associated metadata, documentation, scripts, and any other supporting resources that should be stored alongside the data. Datasets are where all data is stored and documented for later sharing and use in projects. They are the building blocks for projects. They contain data and metadata related to a topic. The files and tabular data in a dataset are meant be used--queried and analyzed--in one or more projects. Datasets are meant to be reusable resources. They can be combined with other datasets in projects, or they can be a single source for querying and analysis in a project.

Datasets can be owned by an individual or an organization, and a dataset provides an additional layer of access permissions to the data in a project. Because permissions are assigned at both the dataset and the project level an individual can create a project available to the public, but if the individual adds any datasets owned by an organization to the project, they won't be visible to the public--only to the other people in the organization. Not only is the dataset not visible, but any queries in the project written against that dataset are also not visible except to members of the organization.

Because datasets are linked to projects, any changes to the data or the metadata in the dataset show up automatically in the linked project. Linking data to a project instead of copying it into the project means that everything is kept up to date throughout your organization.

When to use a dataset and when to use a project

Generally if you are putting up data to share or data that is private but which you might conceivably want to reuse in other projects, it's better to add the data to a dataset. If the data is in a dataset, all of its metadata will automatically show up in your project because the dataset is linked instead of copied. All changes to the original dataset--including automatic updates from the source and manual updates by the dataset owner to the metadata--will also be conveyed.

The table below summarizes the differences between adding data files to a dataset vs. to a project:

Dataset vs. Project

dataset

project

Can run and save queries against

X

X

Can have charts/visualizations

X

Can incorporate different file types

X

X

Can contain multiple files

X

X

Can be shared/have contributors

X

X

Can have a discussion thread

X

X

Can include insights

X

Can use existing data.world datasets without having to download and reimport them and having to recreate the associated meta-data

X

Can be included in a project

X

Can be shared for others to use in their own datasets and projects

X

Verifying your data with data inspectors

When you ingest a tabular data file on data.world it is run through a series of inspections to validate both the structure and content of the data in the file. If issues are found, the file is flagged with a warning. Warnings are indicated by either a yellow triangle or a red circle, depending on the severity. The warning flag can be found on the dataset overview page under the name of the file:

Screen_Shot_2018-12-30_at_4.25.37_PM.png

or on the About this file section as Inspections on the dataset or project workspace for the file:

Screen_Shot_2018-12-30_at_5.39.17_PM.png

The number of warnings is listed to the right of the flag. By far the most common, yellow triangles are there to alert you to potential problems with the data that might affect your ability to query it, or warn you that sensitive data (social security numbers, phone numbers, email addresses, etc.,) was detected.

Very occasionally you will get a red flag which indicates that there was an error on ingest and data from the original file was lost. Possible reasons for the loss of data include:

  • The original file was corrupt.

  • There was a data type mismatch between the data type identified for the column and the data stored in it.

  • Data that you choose to connect to a specified linked data class had values that didn't match the linked data.

For a list of all the inspection warnings and errors, see the article Data Inspectors.

Whether you get a yellow warning or a red error, you have the option to correct it or ignore it. If you get yellow warnings, click on the flag for the warning dialog box view the warning types and locations. The dialog groups the errors by type so you can review them one kind at a time. Each type of warning is labeled with what kind of issue it is, how many were found, and the location of each. Some flags are for issues you already know about and don't wish to fix. Those warnings you can simply dismiss:

Screen_Shot_2018-12-30_at_4.30.22_PM.png

Note: Once you have dismissed a set of warnings it will not show up in the file again even--if you delete and reimport the file or update it. The ONLY way to get a list of all the warnings back is to delete the file and ingest it again with a different name.

If you wish to correct the issues with files that were originally added to data.world by a direct add, you can:

  1. download the file from data.world

  2. make the corrections (the locations in the warnings will help you find them)

  3. re-upload the file using the same name - by using the same name, you'll overwrite the file on data.world (as opposed to creating a new file, which would occur if you changed the name)

For files that are synchronized from external services (such as cloud storage services), you will need to:

  1. update the file in the source system

  2. either select the Sync now button from the details window:

    verifying-your-data-1.png

    or

    from the workspace, choose the Sync now button on the right sidebar:

    verifying-your-data-2.png

Sometimes changes that you make to the data dictionary will cause error warnings in the data. In the example below, one of the columns in the file being ingested holds the ages of shark-attack victims. Some of the values in the column are "20's", "30 or 40", etc. If I wanted to to restrict the data being imported to only integers so I could use arithmetic functions on it, I could go in to the data dictionary for the file after import and set the column type from string to integer. Doing this would immediately cause a red flag on the inspections as some of the data would be left out on re-ingest due to a datatype mismatch:

Screen_Shot_2018-12-30_at_6.11.05_PM.png

For a list of all the inspection warnings and errors, see the article Data Inspectors.