Enterprise docs

Base platform quickstart

At data.world we offer organizations and teams private, secure environments in which they can collaborate together. In using data.world, you and your team can put data and insights in the hands of those who need them while keeping valuable work from getting lost in inboxes or ad hoc conversations. As a result, your team's analysis and work will be more reproducible, reusable, and available for further collaboration.

data.world supports the largest open data community in the world, where we’re excited to connect members with a vast collection of scientific research, government, and demographic data, as well as other members who are interested in similar data work so they can join forces to solve real problems faster.

Regardless if you’re working in an open or closed setting, data on data.world can be brought together in one of two ways:

  1. Datasets help make your data accessible, reusable, and understandable. They contain the data and any additional metadata, scripts, or files that are related to that data. Datasets are the building blocks of data projects.

  2. Projects help organize your data, documentation, and output in one place when working on a specific question, project, or analysis to create reproducible work outside of the black box.

In this guide, we’ll walk through the basics of data.world, covering features and functionality to take you through your first data project from start to finish. Before getting started, ensure you’ve setup your account and joined any organizations you’ve been invited to, and next we’ll walk through finding and adding data.

You can find thousands of free, open data resources on data.world to work with, but quite often you’re going to bring in your own data to use.

When adding data to data.world, you’ll typically want to create a dataset. A dataset is simply a repository of data including data files and associated metadata, documentation, scripts, and any other supporting assets that should be stored alongside the data. Datasets can be created manually, which is what we’ll walk through here, and we also have automated options for working with larger datastores (contact us for more details).

To create a dataset, click on the + New dropdown on the right side of the header bar and select New dataset:

360011707554-mceclip0.png

After you've opened the new dataset you will:

  • Add a title (up to 60 characters)

  • Select an owner (if prompted) - if your dataset is for an organization, we recommend creating it under the team account to keep your organization's work within a single library.

  • Choose the visibility of your dataset:

    • Private - only accessible to you,

    • Organization - allows your team to view the project

    • Open - available to the data.world community

If you’re unsure which permissions to choose, we recommend starting private and adding contributors and increasing visibility as you go.

360011707654-mceclip1.png

Next, either drag and drop one or more files into the add data box, select the Add data button for additional source options, or save your dataset and add data at a later time.

There are multiple ways to add data and connect your data sources:

  • Upload from your computer, or select cloud storage services (Box, Dropbox, Google Drive)

  • Pull directly from URL or API

  • Connect your data via an integration (check out our super connectors if you don't see your datastore listed)

To upload files from your computer either drag and drop the file(s) from your hard drive to the Add data window or select the Add data button for more options:

360011756153-mceclip2.png

So that data owners can fully document their data, data.world supports all file types. Use a script to clean the raw data? Upload it so others can see and build off your work. See the article on file types for more information about how data.world handles different file formats, and when you're ready, select the Create dataset button at the bottom of the page to continue.

When you upload data files (csv, xlsx, json, etc.) to data.world, behind the scenes we're actually converting any readable data to a graph database, so that, regardless of format, it becomes instantly previewable in a tabular format as well as queryable within data.world. More on that later, but due to how we process these files, we can also infer data types, high-level stats about the data, and any potential issues in the data.

After data.world processes your data file, the data inspector will automatically attempt to identify and display any potential issues including blank cells, duplicate rows, and numeric values or string lengths far outside the standard deviation for their field.

The number of potential issues is flagged by the orange exclamation mark at the top of the data preview:

360011756233-mceclip3.png

Select the link to open the data inspector and then review the suggestions. If you feel the highlighted issues are not relevant select Dismiss to prevent them from showing again:

360011756253-mceclip4.png

Any changes to the data will still need to be made in your local copy and re-uploaded, where we'll again process the file and report on any potential issues identified. As long as you upload a file with the exact same name, the previous version will be overwritten (although all versions are maintained and accessible!).

This is also the time for you to verify your data to ensure that the columns in your tables were assigned the correct formats on upload. The format of each column is shown to the left of its name in the table. If you click on the icon you will see a pop-up with more information on the field:

360011707954-mceclip5.png

If you wish to change any column formats and also jump into describing your data, select Edit in the top right of the pop-up which will take you to the data dictionary.

Once you are confident that your data is accurate you are ready to document it. The more complete your documentation, the more useful and understandable your data will be when it comes time for you or others to use it.

The basic information about the entire dataset can be accessed by selecting the Settings tab or by clicking Edit from your dataset main page:

360011756313-mceclip6.png

Within the settings, you can:

  • Add a short, meaningful description to summarize your data

  • Add tags (individually or comma-separated)

  • Select the license type for your dataset (see the article on liecnse types for more details)

  • Edit the visibility level

  • Manage auto-sync functions: if files are connected through a virtual connection or via a URL, choose to automatically refresh hourly, daily, or weekly

  • Select how to import compressed files

  • Delete the dataset

In addition to the metadata under the settings tab there are two more ways to document your dataset: the summary and the data dictionary. Both are accessed from the information box at the top of the overview page:

360011756333-mceclip7.png

Use the Summary section to tell your data's story:

  • Where did the data come from? Cite and link to your sources or include details for a 'citation request'. Not only does this give "credit where credit is due", it helps other members evaluate the data's suitability for their needs.

  • If you think a particular piece of context will be useful to other members in understanding the data, add it.

  • The best summaries cover the “who, what, where, when, why, and how” of the data.

  • Make it visually friendly with Markdown styling. It's easy to learn and goes a long way.

Screen_Shot_2018-09-17_at_9.55.33_AM.png

Back on the dataset main page, click View next to the data dictionary to further document the files and columns within a dataset. You can add descriptions and labels to the individual files, change column types, and add column descriptions:

360011756353-mceclip9.png

Select + Add a description under a filename to add a description and labels to the file:

360011708074-mceclip11.png

Within the above file view, select Column details to add detailed descriptions for each of the columns, particularly if your column names are obscure. You can also specify the data type:

360011708054-mceclip10.png

After uploading data to data.world, if your data contains a common, general type of information that data.world recognizes, we will suggest related information that you can use to enhance your data and research with. We call this data matching, and you'll see it in the platform for things like postal codes, state abbreviations, and medical codes (see this query for a full list), but it could also be setup for identifying data specific to your organization such as product or customer ids to more quickly pull supplemental data into your dataset.

If you see a green triangle on any columns after adding data to the platform, you know we've found a potential match:

360011871693-mceclip0.png

Clicking the corner will present a menu which where you can jump in to review the matches data.world has identified and confirm whether the data in your original column matches the type data.world has identified.

360007578714-mceclip6.png

If you confirm a match by selecting Add matched column, data.world will display any related data that could also be pulled in due to your confirmed match:

360007578754-mceclip8.png

Note that adding matched or related fields will not change your underlying dataset and you can un-match them at anytime. Find more details here.

Once you're ready to invite additional people or teams to your dataset to contribute to or view it, select Contributors from the dataset page and then click the Invite additional contributors link:

360011757434-mceclip12.png

You can search for and invite other data.world users by name or username or you can invite external users by entering their email address and sending them an invitation to join the platform. For organizations, you can quickly add your entire team by adding your organization as a contributor:

360011803613-mceclip13.png

There are 3 types of permissions you can assign to contributors: View, Edit, or Manage; each providing a different level of control over the dataset:

360011803653-mceclip14.png

Now that you've added your data to the platform and allowed your team access to it, lets move on to working with the data on data.world!

Once your data is added to the platform, or you've utilized the robust search capabilities to find and discover open or team-specific datasets you have access to, you'll probably want to do something with them! This is where projects come into play.

While you can do much of the same things in a dataset, the real place to work with data on data.world is in a project. Projects allow you to connect to data from various datasets to combine and analyze them, document your analysis, post insights, share with others, and collaborate through discussions.

To kick off your project, select the + Add link on the top right of your screen and choose New project:

Screen_Shot_2018-08-23_at_12.39.25_PM.png

From there you will be taken to a basic Create a new project page where you'll configure:

  • Owner (if prompted): if your project is for an organization, we recommend creating it under the team account to keep your organization's work within a single library.

  • Project name

  • Project objective: projects are best when they start with a clear question or goal.

  • Project permissions:

    • Private, - only accessible to you.

    • Organization - allows your team to view the project.

    • Open - available to the data.world community.

    If you’re unsure which permissions to choose, we recommend starting private and adding contributors and publishing out as you're ready.

Screen_Shot_2018-09-17_at_10.15.54_AM.png

Once your project has been created, add or link data to your project via:

  1. Finding and linking in existing data.world datasets

  2. Creating your own datasets and linking them into the project

  3. Adding data directly to the project (limit to data that is unique to that project and wouldn't make sense to reuse elsewhere)

To connect or add data, just click the Add data button from your project page or workspace and choose data.world dataset to browse for existing datasets, or New file to add a local project file:

360011757934-mceclip17.png

When you pull an existing dataset into your project, you're really linking to its original location. This means that all updates to the original data will be populated to your project automatically. It also means that you can't modify the underlying data or metadata within the project interface, however you can use it in queries to create the desired output.

When adding new files, you can add data from your computer or cloud storage, urls, and integrations--just as we covered in adding files to datasets.

Projects have metadata the same as datasets do, so document them in the same way as a dataset. For projects, the summary can be used to document your project and/or findings, keep track of your questions, to-dos, and further document your sources. Here are a couple of examples to demonstrate how your summary can be utilized: Exploring THOR and How is the federal government fighting the opioid epidemic?

Once you have curated your project files and resources, you can work with them in the project workspace. To get to the workspace, click the Launch workspace button on the upper right corner of your project page.

In the workspace you can manage your project files and data sources, modify the summary or data dictionary, view or download your project files, preview select file formats in-line, and write SQL or SPARQL queries against supported data files. Once you have output, either directly from a file or the results of a query, you can also connect them to external tools like Tableau, R, or Python through an integration.

360011758254-mceclip18.png

All data files are normalized so they're immediately queryable and joinable, whether they're similar formats or not. This lets you jump right into analyzing and querying to perform calculations, produce summaries, and manipulate data across many different formats and locations.

data.world supports SQL and SPARQL query languages. SPARQL is the query language for graph data, which is how all of the data is stored behind the scenes on data.world. SQL is by far the most widely known query language, so we’ll use that in these examples. Check out our SQL and SPARQL tutorials to learn more on each.

To start writing a query select New query from the menu in the left column or when viewing a particular file, click the Query link at the top of the file display:

360011759254-mceclip19.png

From there you'll be taken to a new tab in the workspace with a pre-populated sample query. Clicking the Run query button on the right will return the results of your query:

360011805893-mceclip20.png

The sample query serves as a good starting point for you to use to build your own SQL queries. You can modify it to return a subset of the initial data or expand on it to start joining your other datasets and tables.

To reveal a list of the tables and columns in your project, select the left arrow button to the right of the Run query button. You can use this schema browser to quickly find, copy and paste column and table names into your query to avoid typos, as well as find quick stats on each field by clicking the 'i' next to it:

360011759754-mceclip21.png

When ready, name your query and save it for later use:

360011759854-mceclip22.png

When you save a query you are prompted to choose who can see the query, either anyone with access to the project or only you. Saved queries show up on the lower left of the workspace window and can be duplicated, edited, or removed by clicking on the three dots to the right of their names:

360011806593-mceclip23.png

All the files you open in the workspace stay open in tabs until you close them so you can easily navigate between them.

Once you have the desired results from your query, you can use the Download button to download them in a CSV or XLSX file, or use the Open in app option to connect them to one of our many integrations such as Tableau, Power BI, Google Data Studio, or your Python or R environment. You can also use the copy URL or embed code option to send a direct download link to someone (data.world login not required) or even embed the query results in a discussion or use as the URL source to power another file within your project or dataset.

360011761114-mceclip24.png

Note that all files, queries, datasets, and projects are their own API endpoints, so if you don't see an integration with your desired tool, you could still work with it via our REST API. Send us a message as well, as we're always working with partners to expand data.world integrations!

When you're ready to visualize data from a table or a query, pull it into one of the many supported integrations, or utilize data.world's Chart Builder integration.

Chart Builder is an open-sourced integration which utilizes Vega Lite and provides a light-weight and easy way to create quick visualizations of your data. Embed your charts back into your project as files or insights, and easily share with other users.

Select Chart at the top of the data table or query results, or select Open in App and select Chart Builder:

360011809193-mceclip25.png

This will open a new browser tab/window showing the chart builder interface. Once you have set your chart configuration, a preview will be displayed in the main window:

360011762934-mceclip26.png

Your chart can now be downloaded as a graphic or JSON for embedding into other applications, or you can also embed it back into your project or dataset as a file or insight using the Save as… option:

360011811433-mceclip27.png
360011764294-mceclip29.png

Similar to adding contributors to a dataset, you can add people to your project. Go to the People tab from your project page to add and manage individual and team access to your project:

360011764354-mceclip30.png

Once others have access to your project, you can utilize discussions to engage team members, ask questions, and discuss elements such as charts, queries, or insights. All discussions support markdown, so you're able to embed images, charts, and even other datasets and queries.

Discussions can be had at the project and topic level under the Discussions tab:

360011764414-mceclip31.png

Or on specific insights posted to a project:

360011764434-mceclip32.png

Tag people in your posts to ensure they get notified, and embed visualizations or other embeddable content using the embed URL options (see some examples). For example, the following post embeds an interactive Tableau visualization, and also tags the user @hhaveliw for credit:

360011764494-mceclip33.png

When posted, it would render as:

360011764614-mceclip34.png

All data.world datasets and projects are logged and versioned so you can see what activity has happened across your team, as well as get back to previous revisions. Access past activity through the Activity tab on your dataset or project:

mceclip35.png

Select Versions to download previous versions:

360011811853-mceclip36.png

There are many more resources where you can find data.world information and support. Here are some useful links, and please reach out if you're unable to find what you're looking for or have feedback for our team!