Community docs

When to link a dataset and when to download and reimport it

There are many great projects and datasets on data.world, and it's likely that at some point you are going to want to use data from them in your own work. There are two different ways to reuse data on data.world: linking, and downloading and re-uploading. Which option you choose depends on a few factors:

  • Is the source data in a project or a dataset?

  • How well does the source data meet your needs?

  • Is the data either streamed or regularly updated?

If the data is in a dataset (as opposed to in a project), is well-documented, concise, and clean, you may very well want to link to it. However if you need to make changes to it, you'll need to download it, edit it, and re-upload it.

Some reasons you might choose to link to the dataset include:

  • You don't need to make any additions to the source dataset (e.g., adding extra columns with data.world linked-data fact tables)

  • The source dataset is really clean so you don't need to go in and clean it up

  • The dataset is well-documented with a good dataset summary, references to the original source, and a complete data dictionary

  • The dataset is automatically updated from an external source

Some reasons you might choose to download and re-upload data include:

  • You want to add columns from data.world linked-data fact tables (e.g., US census region, currencies, ICD10 medical codes, etc.,)

  • You only want to use a subset of the files in a dataset and don't want the rest of the files adding unnecessary complexity to your dataset or project

  • The data files would benefit from cleaning for clarity (e.g., removing blank columns, removing columns containing a single value, changing file or column names, etc.,)

  • The data files only exist in a project and not in a dataset

  • The data dictionary and/or dataset summary are incomplete and you do not have write privileges to the dataset.

The table below summarizes the differences between linking and downloading a file and re-uploading a data file:

Linked vs Reimported Data

Linked

Reimported

Can add to a project

X

X

Can add to a dataset

X

Can extend data with data.world linked-data fact tables

X

Can edit data dictionary and dataset summary

X

Must recreate the dataset summary and the data dictionary for every file in the dataset

X

Do not have to use all of the files in a dataset

X

Can reuse data dictionary and other metadata

X

Automatically updated from original dataset

X

Must include all the files in a given dataset

X