Docs portal

Reference

Aggregation and function support

Redshift aggregation support

Aggregation

Support

approx_distinct

Emulated

approx_median

Emulated

approx_percentile

Emulated

arbitrary

Emulated

array_agg

Unavailable

avg

Natively Supported

bool_and

Natively Supported

bool_or

Natively Supported

checksum

Emulated

correlation

Emulated

count

Natively Supported

count(*)

Natively Supported

count_if

Natively Supported

covar_pop

Emulated

covar_samp

Emulated

group_concat

Emulated

kurtosis

Emulated

max

Natively Supported

max_by

Emulated

min

Natively Supported

min_by

Emulated

regr_avgx

Emulated

regr_avgy

Emulated

regr_count

Emulated

regr_intercept

Emulated

regr_r2

Emulated

regr_slope

Emulated

regr_sxx

Emulated

regr_sxy

Emulated

regr_syy

Emulated

skewness

Emulated

std_pop

Natively Supported

std_samp

Natively Supported

stdev

Natively Supported

sum

Natively Supported

var_pop

Natively Supported

var_samp

Natively Supported

variance

Natively Supported

Redshift function support

Function

Support

abs

Natively Supported

acos

Natively Supported

array

Unavailable

array_append

Unavailable

array_concat

Unavailable

array_contains

Unavailable

array_join

Unavailable

array_length

Unavailable

array_prepend

Unavailable

asin

Natively Supported

at_time_zone

Natively Supported

atan

Natively Supported

atan2

Natively Supported

attr_of

Emulated

ceiling

Natively Supported

char

Emulated

coalesce

Natively Supported

concat

Natively Supported

cos

Natively Supported

cosh

Emulated

current_user

Emulated

date_add

Natively Supported

date_diff

Natively Supported

date_format

Natively Supported

date_parse

Emulated

date_part

Natively Supported

date_sub

Natively Supported

date_trunc

Natively Supported

day

Natively Supported

degrees

Natively Supported

element_at

Unavailable

exp

Natively Supported

exp10

Natively Supported

floor

Natively Supported

get_path

Emulated

greatest

Natively Supported

hours

Natively Supported

iri_of

Emulated

json_extract_scalar

Emulated

label_of

Emulated

least

Natively Supported

left

Natively Supported

length

Natively Supported

like

Natively Supported

log

Natively Supported

log10

Natively Supported

lower

Natively Supported

lpad

Natively Supported

ltrim

Natively Supported

md5

Natively Supported

mid

Natively Supported

minutes

Natively Supported

mod

Natively Supported

month

Natively Supported

now

Natively Supported

pi

Natively Supported

position

Natively Supported

pow

Natively Supported

radians

Natively Supported

rand

Natively Supported

random

Natively Supported

regex

Natively Supported

regexp_extract

Emulated

replace

Natively Supported

right

Natively Supported

round

Natively Supported

rpad

Natively Supported

rtrim

Natively Supported

seconds

Natively Supported

sha1

Emulated

sha256

Emulated

sha384

Emulated

sha512

Emulated

sign

Natively Supported

sin

Natively Supported

sinh

Emulated

sqrt

Natively Supported

string_split

Emulated

substring

Natively Supported

tan

Natively Supported

tanh

Emulated

trim

Natively Supported

upper

Natively Supported

url_extract_fragment

Emulated

url_extract_host

Emulated

url_extract_parameter

Emulated

url_extract_path

Emulated

url_extract_port

Emulated

url_extract_protocol

Emulated

url_extract_query

Emulated

year

Natively Supported

Common license types for datasets

Common licenses in order of most open to most restrictive:
Public Domain Mark - Public Domain

Dedicate your dataset to the public domain: This isn’t technically a license since you are relinquishing all your rights in your dataset by choosing to dedicate your dataset to the public domain. To donate your work to the public domain, you can select “public domain” from the license menu when creating your dataset.

Open Data Commons Public Domain Dedication and License - PDDL

This license is one of the Open Data Commons licenses and is like a public domain dedication. It allows you, as a dataset owner, to use a license mechanism to surrender your rights in a dataset when you might not otherwise be able to dedicate your dataset to the public domain under applicable law.

Creative Commons Attribution 4.0 International CC-BY

This license is one of the open Creative Commons licenses and allows users to share and adapt your dataset so long as they give credit to you.

Community Data License Agreement – CDLA Permissive-2.0

This Community Data License Agreement is similar to permissive open source licenses such as the MIT license. It allows users to use, modify and adapt your dataset and the data within it, and to share it. The CDLA-Permissive-2.0 terms explicitly do not impose any obligations or restrictions on results obtained from users’ computational use of the data. The 2.0 version is significantly shorter, uses plain language to express the grant of permissions and requirements. The only obligation is to "make available the text of this agreement with the shared Data," including the disclaimer of warranties and liability.

Open Data Commons Attribution License - ODC-BY

This license is one of the Open Data Commons licenses and allows users to share and adapt your dataset so long as they give credit to you.

Creative Commons Attribution-ShareAlike 4.0 International - CC-BY-SA

This license is one of the open Creative Commons licenses and allows users to share and adapt your dataset so long as they give credit to you and distribute any additions, transformations or changes to your dataset under this license. We consider this license (a.k.a a viral license) problematic since others may decide not to work with your CC-BY-SA licensed dataset if there is risk that by doing so their work on your dataset will need to be shared under this license when they would rather use another license.

Community Data License Agreement – CDLA-Sharing-1.0

This license is one of the Community Data License Agreement licenses and was designed to embody the principles of "copyleft" in a data license. It allows users to use, modify and adapt your dataset and the data within it, and to share the dataset and data with their changes so long as they do so under the CDLA-Sharing and give credit to you. The CDLA-Sharing terms explicitly do not impose any obligations or restrictions on results obtained from users’ computational use of the data.

Open Data Commons Open Database License - ODC-ODbL

This license is one of the Open Data Commons licenses and allows users to share and adapt your dataset so long as they give credit to you and distribute any additions, transformation or changes to your dataset under this license. We consider this license (a.k.a a viral license) problematic since others may decide not to work with your ODC-ODbL licensed dataset if there is risk that by doing so their work on your dataset will need to be shared under this license when they would rather use another license.

Creative Commons Attribution-NonCommercial 4.0 International - CC BY-NC

This license is one of the more restrictive Creative Commons licenses. Users can share and adapt your dataset if they give credit to you and do not use your dataset for any commercial purposes.

Creative Commons Attribution-NoDerivatives 4.0 International - CC BY-ND

This license is one of the more restrictive Creative Commons licenses. Users can share your dataset if they give credit to you, but they cannot make any additions, transformations or changes to your dataset under this license.

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International - CC BY-NC-SA

This license is one of the most restrictive Creative Commons licenses. Users can share your dataset only if they (1) give credit to you, (2) do not use your dataset for any commercial purposes, and (3) distribute any additions, transformations or changes to your dataset under this license. We consider this license a viral license since users will need to share their work on your dataset under this same license and any users of the adapted dataset would likewise need to share their work on the adapted dataset under this license and so on for any other changes to those modified datasets.

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International - CC BY-NC-ND

This license is one of the most restrictive Creative Commons licenses. Users can share only your unmodified dataset if they give credit to you and do not share it for commercial purposes. Users cannot make any additions, transformations or changes to your dataset under this license.

Additional License Coverage Options

If a license is not listed in the data.world menu options, you may select Other and specify the details in the summary of your dataset.

No license specified

No one can use, share, distribute, re-post, add to, transform or change your dataset if you have not specified a license.

These descriptions are only summaries of these licenses. For the actual text of the licenses, which we strongly encourage you to read, click on the links provided.

Summary of common license types:
Public Domain

The work has been dedicated to the public domain by waiving all rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.

Attribution

You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

Share-alike

If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Non-commercial

You may not use the material for commercial purposes.

Database Only

License applies to the database only and not its contents or data.

No Derivatives

No Derivative Works. You may not alter, transform, or build upon this work.

All licenses that begin with CC-BY in the table above refer to version 4.0 of those licenses

Please submit a ticket if you have additional licensing questions.

Data Inspections

When loading your file into data.world, the following warnings may be generated. These warnings will only be visible to the dataset owner and any contributors with write access to the dataset.

Warnings are informational only and may be ignored.

Data limits

The size of data files you can store on data.world is set by your account plan. To see your file limits, go to your profile >settings> > billing. More information on free and paid accounts can be found here. Here's what we currently support:

Dataset Limits:

A dataset ingested by data.world may have a maximum size of 1GB and up to 250 individual files. Datasets from live connections have no size limit, nor do metadata management datasets created by metadata crawling.

Individual File Upload Limits:

The maximum size for an individual file is 1GB. If you have a file that is larger than that, try compressing the file to get it under the limit, but note that it would then only be available for download due to size constraints.

Inference & Preview Limits:

Non-tabular files that can be previewed only display a file preview if less than 40k. Images will be displayed beyond that limit if possible.

For xls / xlsx, the file must be less than 100MB uncompressed for us to support query and data preview functionality.

For other supported data files, we will provide data preview and query capabilities up to 1GB.

For deeper details we have tables with specific size limit and timeout information. Please contact us if your application requires a greater number of files or a larger maximum file size.

Definitions of common data.world terms

Name

Description

Summary

Administrator

The person in an organization who can manage organization members and access levels, and access all data sets and projects owned by the organization (even private ones).

API

Application Program Interface

A set of routines, protocols, and tools for building software applications. Basically, an API specifies how software components should interact. Additionally, APIs are used when programming graphical user interface (GUI) components.

Article

Documentation on data.world is broken up into four different types. One of those types is articles which are instructional for a specific task or feature, and are not hands-on.

Best practices

Best practices is a type of documentation which is instructional, not hands on, and recommends a specific way of doing something.

Bookmarks

You can add a bookmark to any dataset or project that interests you, whether or not it is owned by you or your organization. Search is enabled in your bookmarks section to help you quickly find datasets or projects. If your data project is bookmarked, you can think of it as similar to a "like" on Facebook.

Business glossary

A list of terms defined as they are used in your specific business environment.

Catalog

A catalog is an organized list of information.

CC BY-NC

Creative Commons Attribution-NonCommercial 4.0 International

This license is one of the more restrictive Creative Commons licenses. Users can share and adapt your dataset if they give credit to you and do not use your dataset for any commercial purposes.

CC BY-NC-ND

Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

This license is one of the most restrictive Creative Commons licenses. Users can share only your unmodified dataset if they give credit to you and do not share it for commercial purposes. Users cannot make any additions, transformations or changes to your dataset under this license.

CC BY-NC-SA

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International

This license is one of the most restrictive Creative Commons licenses. Users can share your dataset only if they (1) give credit to you, (2) do not use your dataset for any commercial purposes, and (3) distribute any additions, transformations or changes to your dataset under this license. We consider this license a viral license since users will need to share their work on your dataset under this same license and any users of the adapted dataset would likewise need to share their work on the adapted dataset under this license and so on for any other changes to those modified datasets.

CC BY-ND

Creative Commons Attribution-NoDerivatives 4.0 International

This license is one of the more restrictive Creative Commons licenses. Users can share your dataset if they give credit to you, but they cannot make any additions, transformations or changes to your dataset under this license.

CC-0

Creative Commons Public Domain Dedication

This license is one of the open Creative Commons licenses and is like a public domain dedication. It allows you, as a dataset owner, to use a license mechanism to surrender your rights in a dataset when you might not otherwise be able to dedicate your dataset to the public domain under applicable law.

CC-BY

Creative Commons Attribution 4.0 International

This license is one of the open Creative Commons licenses and allows users to share and adapt your dataset so long as they give credit to you.

CC-BY-SA

Creative Commons Attribution-ShareAlike 4.0 International

This license is one of the open Creative Commons licenses and allows users to share and adapt your dataset so long as they give credit to you and distribute any additions, transformations or changes to your dataset under this license. We consider this license (a.k.a a viral license) problematic since others may decide not to work with your CC-BY-SA licensed dataset if there is risk that by doing so their work on your dataset will need to be shared under this license when they would rather use another license.

CDLA-Permissive-2.0

Community Data License Agreement – Permissive, Version 2.0

Community Data License Agreement – Permissive, Version 2.0This Community Data License Agreement is similar to permissive open source licenses such as the MIT license. It allows users to use, modify and adapt your dataset and the data within it, and to share it. The CDLA-Permissive-2.0 terms explicitly do not impose any obligations or restrictions on results obtained from users’ computational use of the data. The 2.0 version is significantly shorter, uses plain language to express the grant of permissions and requirements. The only obligation is to "make available the text of this agreement with the shared Data," including the disclaimer of warranties and liability.

CDLA-Sharing-1.0

Community Data License Agreement – Sharing, Version 1.0

This license is one of the Community Data License Agreement licenses and was designed to embody the principles of "copyleft" in a data license. It allows users to use, modify and adapt your dataset and the data within it, and to share the dataset and data with their changes so long as they do so under the CDLA-Sharing and give credit to you. The CDLA-Sharing terms explicitly do not impose any obligations or restrictions on results obtained from users’ computational use of the data.

Classroom

A classroom is a type of organization you can set-up in data.world so you and your students can upload datasets, create projects, discuss, and share insights. A classroom includes unlimited private projects & datasets, 1GB per project/dataset, & up to 100 members, so it's a perfect way to collaborate with any group that needs to learn together.

Columns

Data in tabular format is arranged into rows and columns. Columns represent data of the same type across all the records.

Community

The data.world community includes every person who uses the platform whether enterprise, educational, or individual.

Content contributor

A Content Contributor is a person in an organization who can create and interact with the organization's projects and datasets.

Contributor

A Contributor is a person who is invited to access a dataset or project. Contributor permissions can be set to Discover only, View only, Edit (view and edit), or Manage (view, edit, and manage).

Created and Updated Date

Created and updated are two operators which can be used to find datasets, projects, insights, users and organizations based on the date they were added or last updated. Timestamps are set in UTC, not your local time, so you might get results that are a day off of your local time depending on where you are:

Creator

The creator of a dataset or project is the individual who creates it. The creator can be different from the owner (see owner for more details). The distinction between owner and creator is important for organizations as the owner manages a resource with the same privileges as the creator, but owners can be changed (as personnel changes) while creator is a static entry.

Crowdsourced data

An organization can be configured so that an individual outside the organization can propose that the organization own a dataset created by the individual. Datasets created in this way are called crowdsourced data.

CSV

Comma-Separated-Value is a file format used to transform text into tables. Commas are used to separate the data into columns of the same data type, and paragraph breaks are used to separate it into records or rows.

Data

Data is just information, and it can take many forms from images to spreadsheets. Data in data.world can be in any file format.

Database

A structured set of data held in a computer, especially one that is accessible in various ways.

Data dictionary

The data dictionary contains all the metadata (data about the data) for the files, tables and columns in a dataset. For all files it contains:

The names of all the files in the dataset, a place to add descriptions for each file, and the labels for each file. For tabular files it has: The column names, the format of the data in each column, and a place to add a description for each column.

Data inspector

When data is ingested into data.world the Data Inspector evaluates it to rapidly diagnose issues with it. The inspector does not examine data brought in through a live connection, only data uploaded to data.world

Data sources

A data source is any place you can get data from including databases, local files, cloud-based files, real-time sources like log files, SaaS data, URL's, a corporate network.

Dataset

Datasets are where all data is stored and documented for later sharing and use in projects. A dataset is the basic repository for data files and associated metadata, documentation, scripts, and any other supporting resources that should be stored alongside the data.

Description fields

Datasets, projects, all the files in each, and all the columns in any structured data files have description fields associated with them. Descriptions are very short and serve as a quick reference for the item they describe.

FAQ

Frequently Asked Question

A document format consisting of questions and answers.

Glossary

A glossary is an alphabetical list of terms or words found in or relating to a specific subject with explanations; a brief dictionary.

Graph database

A graph database is a database that uses graph structures for semantic queries with nodes, edges and properties to represent and store data.

Insights

Findings, conclusions, and interesting points for discussion about a project are stored as insights in the project.

Integration

An application or program that connects to data.world in order to transport, manipulate, sync, or share data and analyses of the data.

JSON

JavaScript Object Notation

JSON (pronounced jay-saun) is a language-independent, open standard file format, and data interchange format, that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and array data types (or any other serializable value).

Resources

Your resources are the datasets and projects owned by you or your organization(s).

license

data.world allows you to specify how you allow data you own to be used by others.

license type

By providing a license, you are setting expectations about how you want your data to be used. You can think of a license as the Terms of Use for your data.

Markup language

A markup language is a computer language that uses tags to define elements within a document. It is human-readable, meaning markup files contain standard words, rather than typical programming syntax. The two most common mark-up languages are HTML and XML.

Metadata

National Information Standards Organization (NISO), Metadata is structured information that describes, explains, locates, or otherwise makes it easier to retrieve, use, or manage an information resource. Metadata is often called data about data or information about information.

Metadata catalog

An organized list containing all the information about your data resources. For example, the source, the type, the location, the owner, the update and creation dates, descriptions of the resource, etc.

Metamap

A graph-based data repository containing the metadata about all public datasets stored in data.world.

ODC-BY

Open Data Commons Attribution License

This license is one of the Open Data Commons licenses and allows users to share and adapt your dataset so long as they give credit to you.

ODC-ODbL

Open Data Commons Open Database License

This license is one of the Open Data Commons licenses and allows users to share and adapt your dataset so long as they give credit to you and distribute any additions, transformation or changes to your dataset under this license. We consider this license (a.k.a a viral license) problematic since others may decide not to work with your ODC-ODbL licensed dataset if there is risk that by doing so their work on your dataset will need to be shared under this license when they would rather use another license.

OKTA

Cloud software that helps companies manage and secure user authentication into modern applications, and for developers to build identity controls into applications, website web services and devices. Provides secure identity management with Single Sign-On, Multi-factor Authentication and Lifecycle Management (Provisioning).

Organization

A group on data.world that you belong to which determines what data resources you can see and edit.

Owner

When a dataset or project is created the person creating it is the creator, but the owner can be designated as either the person who created it, one of the organizations in which the creator is a member, or an organization that accepts ownership proposals. The owner has all the same permissions for management and editing of the dataset or project that the creator has.

PDDL

Open Data Commons Public Domain Dedication and License

This license is one of the Open Data Commons licenses and is like a public domain dedication. It allows you, as a dataset owner, to use a license mechanism to surrender your rights in a dataset when you might not otherwise be able to dedicate your dataset to the public domain under applicable law.

Platform

The data.world application is also referred to as the platform.

Project

Projects are where all querying, analysis and discussion of data takes place in data.world. Data in different datasets can be used for many different projects, but each project contains all and only the data that is relevant for that project. The information in a project can come from datasets, files attached directly to the project, insights written by the project's team members about the data and the project, and discussions about the project.

Public API

The public API is used to create an integration or application with data.world. The API can also be used to get data out of data.world.

Public Domain

Public Domain License

The work has been dedicated to the public domain by waiving all rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.

Query

A statement written to retrieve information from a dataset on data.world. Queries can be written in SQL or SPARQL.

Quick start guide

A quick-start guides is a short hands-on type of documentation derived from tutorials and designed to quickly get users comfortable with basic use of the data.world platform.

RDF

Resource Description Framework

RDF represents information using semantic triples, which comprise a subject, predicate, and object. Turtle provides a way to group three URIs to make a triple, and provides ways to abbreviate such information, for example by factoring out common portions of URIs.

RDF triple store

An RDF triple store is similar to a graph database and stores information in semantic triples. It is accessed and manipulated using the SPARQL query language.

Reference

A type of documentation that includes tables, lists, glossaries, appendices, etc. It is informational, not instructional, in format and is not hands-on.

SAML

Security Assertion Markup Language

An open standard for exchanging authentication and authorization data between parties, in particular, between an identity provider and a service provider. SAML enables Single-Sign On (SSO)

Share-alike license

If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

SPARQL

SPARQL Protocol and RDF Query Language

Pronounced "sparkle", SPARQL is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in RDF format.

SQL

Structured Query Language

SQL is a language used to access and manipulate relational database management systems.

SSO

Single Sign-on

a property of access control of multiple related, yet independent, software systems. With this property, a user logs in with a single ID and password to gain access to any of several related systems.

Streams

Streams are a type of input (jsonl) that allows you to update and append records to a data file on data.world instead of having to re-upload the entire file when changes need to be made.

Summary

The summary is one of two documents created with a dataset or project. The summary is where all of the information about the origin of the data, why you created the dataset, further documentation of your work, etc. is found. Use the Summary section to tell your data's story.

Tag

Tags can be used to organize and group your dataset or project by topic, category, source, department, or team. They can be searched for explicitly with the tag search operator, and can also help to filter down more generic search results.

Team

A team is a group of people working on a project. A team could be an organization or a subset of an organization.

Title

The name of the dataset or project. Titles are accessible via search.

Triple

AKA Semantic triples

A triple is a set of three entities that arranges a statement about semantic data in the form of subject–predicate–object expressions. Each item in the triple is expressed as a Web URI.

TTL or Turtle

Terse RDF Triple Language

Terse RDF Triple Language (Turtle) is a syntax and file format for expressing data in the RDF data model. Turtle syntax is similar to that of SPARQL. Turtle provides a way to group three URIs to make a triple, and provides ways to abbreviate such information, for example by factoring out common portions of URIs.

Tutorial

One of our four types of documentation is a tutorial. Tutorials are instructional, in depth, and hands-on. A variation on the tutorial is a quick start which is a shorter, derivative version of a tutorial.

URI

Uniform Resource Identifier

A string of characters that unambiguously identifies a particular resource. To guarantee uniformity, all URIs follow a predefined set of syntax rules but also maintain extensibility through a separately defined hierarchical naming scheme (e.g. http://).The most common form of URI is the Uniform Resource Locator (URL), frequently referred to informally as a web address.

White paper

A high-level, but very technical document. It is informational, not instructional, in format and is not hands on.

FAQ

Can I change my user name?

We don't currently offer the ability to change your username within data.world, however, there are a couple of workarounds:

  1. create a new account using a different email address. If you'd like the initial account removed once you've created your new account, just submit a request for us to do so. Once removed, you could then go into your account settings to update your email if desired.

  2. submit a request for us to delete your account which will free up your email address so you'll be able to create a new account with the preferred username.

Note that both of these options will remove all content and social activity (likes, follows, etc.) associated with the account being deleted. Please be sure to back up your work and be ready to recreate it under the new account.

Can I update a file in my dataset?

Yes! To update a file on data.world simply upload the updated version with the same name and we will overwrite the existing one with the new version.

Note that we also store previous versions of you files so you can always revert to them if you need to.

How do I delete my account?

To remove or cancel a data.world account, you'll currently need to submit a request for data.world to manually remove it.

We hope to allow members to manage this in the future, but until then are happy to help with your direct request and also appreciate any final feedback you might share with us as part of it.

Note that upon deletion, all content stored under your account will be removed and your username will be back up for grabs by new members.

How much will data.world cost me?

data.world is free for individuals and small teams to discover and use open data, as well as create and collaborate on their own Datasets and Data Projects up to a specific size and number.

In line with data.world's mission to build the most meaningful, collaborative, and abundant data resource in the world, there is no limit to the number of public Datasets or Projects created and we encourage all of our members to help in this mission by adding open datasets they're building or working with!

For members and teams who need additional limits and features beyond what our free tier provides, please see our pricing page for details on the available options.

What are the size limits for data.world?

The data.world team is hard at work in extending the boundaries of the platform. Depending on your account plan (free or paid). A list of of what we currently support can be found in the article on data size limits.

What file types can I upload?

There is no restriction on file types that can be uploaded or downloaded on data.world, and a dataset can consist of any combination of files added to it. There are some, size limitations, and files are handled differently based on their extension.

What's the difference between Open and Private?

When creating a dataset or Data Project, you're given the option between open and private. Open datasets and projects will be visible, in their entirety, to anyone signed into data.world. They could be returned in search results, will be visible under your profile and will be available for querying and downloads. No other members will be able to change the dataset or project without explicit permission from you by adding them as a contributor with edit rights. If you are in an organization you have additional options for determining who can access and use your data. For more information see the article on setting dataset permissions .

If you can neither upload nor download data from data.world you might be behind a firewall that's blocking your access. If you think that might be the problem, try performing the same tasks on a different network, such as your home internet connection. You can find information about configuring your network firewall to accept data.world connections in Allowlist for data.world.

File upload status messages

Below is a list of status messages you might encounter when uploading data files to data.world. Please open a support ticket for additional assistance.

Error message

More details

No data could be extracted from this file **

This status will display if a file type is supported by data.world, yet cannot be previewed.

Check for syntax or formatting errors within your file.

Want to see data previews? Reupload this file with an extension.

Currently, data.world depends on file extensions to determine how best to prepare your data. If a file is uploaded with no extension, then you will see this status message.

If you believe this file’s data is actually a known format (say, .csv), then re-upload this file with the new extension added.

Excel files >100MB may only be downloaded.**

Due to how Excel files are structured, in some cases we are not able to fully preview the data inside the file. It is, however, still available for sharing and download.

This file type >100MB can only be downloaded.

The file is too large to properly ingest into data.world and is unavailable for queries or previews.

This file is shareable, though some advanced features may be unavailable due to its size.**

This status indicates that a file contains more cells of data than data.world was expecting. In some cases, you might be able to remove any unnecessary blank columns, rows or tabs.

Only the first 50 of 111 files were extracted.

When uploading archived or compressed files (zip, tar, etc), ensure each contains 50 files or less. Any files over this limit will not be extracted.

2 files were too large to be extracted from this archive. **

If a file within an archive exceeds data.world's data limits, we will show this status.

Try splitting the file into multiple smaller files within our size limits, then reupload.

Sorry, we can't extract the contents of this archive. It may be corrupted.

The archive cannot be extracted for another reason - it may be an invalid archive or an unsupported file type

No data could be extracted from this file.

The file is of a supported type, but has a structural problem that prevents its from being extracted.

This file is shareable, though some advanced features may be unavailable due to the size of this dataset. **

If a data file is uploaded to a dataset that results in the total dataset exceeding what data.world can process, this status will be displayed.

Check for and remove any unnecessary blank columns, rows or tabs from all tabular files within the dataset, or contact support for further assistance.

** Note that these errors are related to enhancing tabular and graph data to provide advanced functionality (data previews and queries). The file will still be uploaded to data.world and be available for download.

Finding help

We offer a number of different help resources for data.world members, including a Slack channel, documentation portal, and a blog with great content. Here we've included details on each.

Slack

The data.world Slack community can be wealth of knowledge, and even includes many of the data scientists and developers building the data.world platform. If you have questions, especially beyond bug reports and functionality requests you should stop by and engage some of the expert users of the platform.

In order to request an invite please visit our Slack sign up page and enter your email address:

mceclip0.png

Follow the instructions to be invited. When your account is active, you can visit the Slack sign-in page to jump in to participate in the conversation!

Blog: Distinct Values

Our blog Distinct Values is a collection of content related to data catalogs, cultures, and communities. We strive to provide thought leadership, interesting news about our platform, and exciting happenings in related to data analysis and visualization.

Documentation portal

data.world has a robust documentation portal that can help with many of the tasks and questions that you may encounter on the platform. In addition to our docs portal we also have documentation on:

Still need help?

If you can’t find what you need, please contact our support team using one of the following methods:

A guide to icons

Here is a list of the icons and the extensions associated with them for popular file on data.world :

  • Tabular: csv, xlsx, xls, json, jsonl, tsv, txt

  • Graph: ttl, rdf, nt, n3

  • Document: md, doc, docx, txt, rtf, pdf, ppt, gslides, gdoc

  • Image: jpg, jpeg, png, gif, svg, vg.json, vl.json

  • Archive: zip, gz, tar, tgz

  • Script: py, ipynb, r, rmd, sas, js, feather, css, html, rproj, htm, html, rdata

  • Query: sql and sparql queries (native to dw)

  • Geo: kml, shp, shx, cpg, prj, geojson, atx

  • Non-tabular data: sqlite, nested json

  • Generic: anything not listed above or when no file type has been given/inferred

SPARQL_query.svg

SPARQL query

SQL_query.svg

SQL query

project.svg

Project

dataset.svg

Dataset

tabular_file.svg

Tabular file

image.svg

Image file

pdf.svg

PDF

insight.svg

Insight

Document.svg

Document

graph.svg

Graph

script.svg

Script

dashboard.svg

Dashboard

tableau.svg

Tableau file

string.svg

Non-tabular data

zup_file.svg

Archive file

geographic.svg

Geo

generic_file.svg

Generic file

Finding your API tokens for data.world

When you need an API token for a third-party application or data.world's metadata catalog collector, you can get it from your profile settings. Click on your avatar and choose Settings:

Profile_settings.png

Then select Advanced from the sidebar:

Advanced tab on profile settings.png

Both Read/Write and Admin tokens are provided. For the metadata catalog collector you can use the Read/Write token for the metadata catalog collector if you have write permissions to your organization's ddw-catalogs dataset.

Licensing and data you found

I've found an interesting dataset and want to put it on data.world. Can I do that?

You'll need to check the licensing terms on that dataset to see if you are authorized by the owner to distribute, re-post, re-publish or share it. If those terms allow you to do these things, you'll also need to review and comply with the conditions under which you can do so. We've put together a list of common licenses for datasets with links to the license terms here.

If the dataset is available to the public on the Internet, why do I need to check and comply with the terms?

Even if datasets are publicly available, their owners can continue to have rights in those datasets. Those rights extend to how the data is organized, displayed, described, visualized, etc. and can include the effort in compiling the data. These intellectual property rights need to be respected. To do so, make sure that you read and comply with the license terms on the dataset.

What happens if I don't comply with a dataset's license or terms?

If you don't comply with the license and terms of use on a dataset, you could be found to be in breach of contract and/or violation of copyright law. For example, if you are found by a court to have violated US copyright law, you would have to pay damages set by law without the owner of the copyright having to prove he or she suffered financially from your actions.

You could also be in violation of our terms of use by not having the right to post a dataset to data.world, including if you don't specify the appropriate license on a dataset, and you and/or the dataset could be removed from our platform.

Where can I find a dataset's licensing terms and conditions?

Sometimes finding the license terms on a dataset can be difficult. You can look for them:

  • On the main webpage

  • On the page where the summary or description of the dataset is located

  • On the download page of the dataset

  • In the terms of use or terms of service located in the footer of the webpage

  • Under "legal" in the footer of the webpage

But I can't find those license terms. Now what?

After searching the site where you found the dataset, you can't locate any terms or licenses that cover the dataset, you can reach out to the owner to see if he or she will give you permission to use the dataset or put a license on the dataset on the site. A dataset that does not have any license terms means the owner retains all rights in the dataset and does not authorize anyone else to use, copy, distribute, share, combine it with other data, or make any changes to it or derivative works from it.

What about fair use?

Fair use is a tricky area. If you use copyrighted materials in a certain way that complies with the fair use doctrine, you might not be infringing on the copyright. However, courts look at the specific circumstances of the usage, so even if your usage is similar to how others have used copyrighted materials, there is no guaranty that a court will find that you have not violated someone's copyright since your circumstances may be different.

The US Copyright office has summarized Section 107 of the US Copyright Act.

Section 107 provides the framework for determining whether something is a fair use and identifies certain types of uses—such as criticism, comment, news reporting, teaching, scholarship, and research—as examples of activities that may qualify as fair use. Section 107 calls for consideration of the following four factors in evaluating a question of fair use:

  • Purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes: Courts look at how the party claiming fair use is using the copyrighted work, and are more likely to find that nonprofit educational and noncommercial uses are fair. This does not mean, however, that all nonprofit education and noncommercial uses are fair and all commercial uses are not fair; instead, courts will balance the purpose and character of the use against the other factors below. Additionally, "transformative" uses are more likely to be considered fair. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work.

  • Nature of the copyrighted work: This factor analyzes the degree to which the work that was used relates to copyright's purpose of encouraging creative expression. Thus, using a more creative or imaginative work (such as a novel, movie, or song) is less likely to support a claim of a fair use than using a factual work (such as a technical article or news item). In addition, use of an unpublished work is less likely to be considered fair.

  • Amount and substantiality of the portion used in relation to the copyrighted work as a whole: Under this factor, courts look at both the quantity and quality of the copyrighted material that was used. If the use includes a large portion of the copyrighted work, fair use is less likely to be found; if the use employs only a small amount of copyrighted material, fair use is more likely. That said, some courts have found use of an entire work to be fair under certain circumstances. And in other contexts, using even a small amount of a copyrighted work was determined not to be fair because the selection was an important part—or the "heart"—of the work.

  • Effect of the use upon the potential market for or value of the copyrighted work: Here, courts review whether, and to what extent, the unlicensed use harms the existing or future market for the copyright owner's original work. In assessing this factor, courts consider whether the use is hurting the current market for the original work (for example, by displacing sales of the original) and/or whether the use could cause substantial harm if it were to become widespread.

In addition to the above, other factors may also be considered by a court in weighing a fair use question, depending upon the circumstances. Courts evaluate fair use claims on a case-by-case basis, and the outcome of any given case depends on a fact-specific inquiry. This means that there is no formula to ensure that a predetermined percentage or amount of a work—or specific number of words, lines, pages, copies—may be used without permission.

Licensing and data you own

Why license your dataset?

If your dataset does not have any license terms, it means you do not authorize anyone else to use, copy, distribute, share, combine it with other data, or make any changes to it or make derivative works from it. This absence of a license greatly reduces the reuse potential and usefulness of your dataset.

We encourage pick as open a license as you feel comfortable to maximize the benefits of your dataset. We believe the more open a license is, the more others will use your dataset. For more information on the details of licenses, see our list of common license types for datasets.

Common license considerations
Choose an established and current license

By choosing an established license like one from our list of common license types, you are choosing a license that is widely adopted. Such licenses were drafted by organizations dedicated to making those licenses functional in many situations as well as making them interoperable, clear and understandable. You'll need to read the actual licenses by clicking on the links we've provided to make sure you've picked the appropriate one for your dataset and how you would like others to interact with your dataset.

Consider how you want others to use your dataset

The more open a license you choose, the more others can use, share and distribute your dataset to get to insights faster. Your dataset could be important to solving a pressing issue. We encourage you to maximize your dataset's potential by choosing an open license.

Consider the results of a data project

When a project involves a number of datasets, each with different licenses, the licenses may conflict and greatly restrict or even prohibit the resulting work. By choosing the most open license, you amplify your dataset's usefulness. Another tip is to review the licenses of the other datasets that may be involved in a project or used in your industry to determine what type of license would allow your dataset to be used alongside those datasets. Usually, two datasets, both with CC-BY licenses, can be combined under those license terms. However, you will still need to pay attention to the different versions of those licenses to make sure they work with one another. In addition, just because datasets have licenses which are similar like a CC-BY and ODC-ODbL, does not mean those datasets can be combined because of conflicts between those licenses.

Our recommendation

We like the current versions of the open Creative Commons licenses, since these licenses are widely adopted, are applicable to databases and facilitate collaboration. We believe these licenses are becoming the more widely accepted for datasets and databases. In addition, Creative Commons has created a tool to help you choose the appropriate license for your dataset.

For instructions on how to set the license type for a dataset, see Setting a license type

To help determine the license to select, see Common license types for datasets

Find a dataset you'd like to share on data.world? Check out Licensing and data you found.

Notifications

To help you stay on top of what's happening with your data and in your organization, data.world provides a variety of notifications in different formats to various users. To make the notification process more transparent, we have the following tables which lay out the relationships between user and organization permissions, activity in the platform, and notification formats.

Query editor shortcuts

Query editor shortcuts for both SQL and SPARQL are available on data.world. Below is a list of the supported commands for Mac and Windows:

command + option + L

(ctrl + alt + L on Windows)

Automatically reformat your query to make it more readable.

command + shift + enter

(ctrl + shift + enter on Windows)

Automatically reformat AND run your query.

command + enter

(ctrl + enter on Windows)

Run your query.

command + S

(ctrl + S on Windows)

Save your query

Size limit and timeout specifications

Size limits

Account type

Individual/Team Free

Individual/Team Professional

Enterprise

Dataset ingested to data.world

100 MB

1 GB

1 GB

Metadata management datasets

n/a

n/a

no limit

Project

100 MB of project-specific files, no limit on linked datasets.

1GB of project-specific files, no limit on linked datasets.

1 GB

Derived dataset

100 MB

1 GB

1 GB

Virtual dataset (hosted on a remote server)

Size not limited by data.world

Size not limited by data.world

Size not limited by data.world

Size of a file in a dataset

100 MB

1 GB

1 GB

Number of files in a dataset

250

250

250

Number of columns in a table

limited by file size

limited by file size

limited by file size

Number of columns previewed in a table

50

50

50

Number of columns previewed in query results

500

500

500

Number of rows in a table

limited by file size

limited by file size

limited by file size

Number of rows previewed in query results

10,000

10,000

10,000

Rate limiting: Number of burst streams API calls

5 in the first second or after a 5 second idle period, then 1 per second

5 in the first second or after a 5 second idle period, then 1 per second

5 in the first second or after a 5 second idle period, then 1 per second

Size of a record streamed

1 MiB

1 MiB

1 MiB

Size of a request streamed

100 MB

1 GB

1 GB

Number of JSON objects in a stream

100 MB divided by the avg record size

1 GB divided by the avg record size

1 GB divided by the avg record size

Timeouts

Account type

Individual/Team Free

Individual/Team Professional

Enterprise

Query timeout before first byte is transmitted

1 minute

1 minute

1 minute

(upgrade to 5 minutes available upon request)

Query timeout before last byte is transmitted

60 minutes

60 minutes

60 minutes

Data upload timeout

None: As long as packets continue to be passed the connection will stay open

None: As long as packets continue to be passed the connection will stay open

None: As long as packets continue to be passed the connection will stay open