Community docs

Document your data

Once you have created a dataset or project and added your files to it, you can make it easier to find and more useful to others by describing, or documenting it. Documenting consists of creating the metadata for your dataset or project and helps others to trust your data and work. Searches on data.world also look at titles, descriptions, summary, and tags to match search strings so the more completely you describe your data the more chance it has of being found.

The starting point for describing your data is the dataset or project overview page. From here you can edit the description, create the summary, assign tags, set the licensing, and complete the data dictionary:

Screen_Shot_2018-09-24_at_5.35.11_PM.png

Description

Datasets, projects, all the files in each, and all the columns in any structured data files have description fields associated with them. Descriptions are very short and serve as a quick reference for the item they describe. To edit the description for a dataset you can select Edit next to the description, Edit next to About this dataset, or navigate to the Settings tab:

Screen_Shot_2018-09-24_at_5.35.11_PM.png

Summary

The summary is one of two documents created with a dataset or project. The summary is where all of the information about the origin of the data, why you created the dataset, further documentation of your work, etc. is found. Use the Summary section to tell your data's story. For example:

  • Where did the data come from? Cite and link to your sources or include your details for a 'citation request'. Not only does this give credit where credit is due, but it helps other people evaluate the data's suitability for their needs.

  • If you think a particular piece of context will be useful to others, add it.

  • The best summaries cover the "who, what, where, when, why, and how" of the data.

  • What's the data telling you? What would others be interested to know about it? What have others found using this data?

  • If the data has associated data dictionaries or other documentation, upload it and then link to it from your Summary.

  • Summaries are created and edited in either the data.world Simple Editor or in Markdown.

Organizing a dataset with file labels

When a dataset contains many files, determining the purpose of each file can be difficult without examining it more thoroughly. By adding file labels, you can see each file's category at a glance. The following file labels are available:

  • raw data

  • clean data

  • documentation

  • script

  • visualization

  • report

You can add file labels from a dataset overview page by clicking the three dots icon on the right side of the file preview and choosing Edit file metadata:

file-labels-01.png

That will bring up a new window where you can add file labels and edit other file metadata:

file-labels-02.png

To update those labels later, just follow the same steps and add or remove them as needed.

Tagging

Tags are a powerful feature that you can use in a variety of ways to facilitate access to your data. For example, tags can be used to organize and group your dataset or project by topic, category, source, department, or team. They can be searched for explicitly with the tag search operator, and can also help to filter down more generic search results.

You can add or remove tags from a dataset or project's Overview page with either the Edit or Add tags links on the right side:tagging-1.png

There is no limit to the number of tags you can use for a dataset, and there is an autofill feature on the tag field. If the dataset is owned by an organization, the tags displayed for autofill are chosen from all the tags used by the organization. If the dataset is not owned by an organization, the autofill suggestions are from a generic list of tags as well as from tags you have recently created.

Data dictionary

The data dictionary contains all the metadata (data about the data) for the files, tables and columns in a dataset. For all files it contains:

  • The names of all the files in the dataset

  • A place to add descriptions for each file

  • The labels for each file

and for tabular files it has:

  • The column names

  • The format of the data in each column

  • A place to add a description for each column

You can get to the data dictionary either from the Overview tab (right below the Summary) or from the Documents section in the left pane of the workspace:

Screen_Shot_2018-12-07_at_3.48.48_PM.png
Screen_Shot_2018-12-08_at_2.47.46_PM.png

Data dictionary entries for each file are edited separately by selecting the Edit link next to the filename in the data dictionary document. Every file--no matter what type--has a data dictionary entry which contains the file metadata for the file:

Screen_Shot_2018-09-27_at_4.58.18_PM.png

Tabular files also have optional advanced settings and csv settings additional options in their file metadata:

Screen_Shot_2019-04-15_at_12.22.41_PM.png

The Authentication setting allows you to specify password, token, or OAuth parameters if the source URL requires authentication. The Headers setting is to specify options to modify the response from the URL, e.g., to specify a file content type. The Post body setting enables you to switch the request method from GET to POST if the source URL requires it.

The CSV settings section manages how your comma separated value format files are handled. To access it, select Show to the right of the section:

Screen_Shot_2019-04-15_at_12.37.52_PM.png

Tabular files also have a tab for columnar metadata in their data dictionary where you can rename the columns, change their format, and add descriptions for them:

Screen_Shot_2018-09-27_at_5.04.59_PM.png

Changing column names and adding a description is a great way to avoid the ambiguity that comes from having multiple columns with the same name. It also renders obscure column names understandable.

Changes to column names, descriptions, and data types propagate throughout data.world to every project that references the dataset, and the changes remain even if the data is updated from an external source.

Setting a license type

Setting a license type for a dataset is important to explicitly define how others may use the data. Licensing is determined by two factors:

  • The licensing of the source documents in the dataset

  • The wishes of the dataset owner

The general rule of thumb is that the most restricted license for the source material is the least restricted license that can be used for the dataset. However--existing source licensing again being taken into account--the owner of the dataset can choose even more stringent licensing for others who wish to use the dataset. To set the license type:

  • Go to the dataset overview page

  • Click Edit next to About this dataset on the right side of the screen

licensing-01.png
  • Choose the appropriate license from the Public license dropdown menu and save

licensing-02.png

When creating a dataset on data.world, use the following articles to help determine the license type you should use: