Skip to main content

Dataset best practices

When working with datasets that will support multiple analysis projects, it is crucial to organize your data, metadata, and documentation effectively. Utilize a dedicated dataset for this purpose. Create a project to manage all aspects of the analysis work, linking to the datasets instead of duplicating them. For more details, refer to our article on datasets vs. projects.

By following these best practices, you can ensure that your datasets are efficient, accessible, and valuable to both your team and the broader data community.

Data upload recommendations

  • File format: For tabular data, opt for the CSV file format rather than Excel (xlsx) format. CSV files allow for larger file sizes, facilitating more efficient querying.

  • Data file structure: Remove headers, footers, or extraneous notes, retaining only a single row of column headers. Any content removed from the data file should be included in the dataset summary or uploaded as a separate notes file. This practice ensures maximum compatibility with data.world's import and analysis tools.

Data documentation

  • Tag and annotate: Clearly tag and document your data to enhance understanding and usability for others.

  • Data verification: Utilize the data inspector to verify successful data import and assess data quality.

Dataset size and organization

  • Size limitations: Ensure your dataset complies with data.world's size limitations. You can upload a zip file and extract it post-upload if the application allows. For support on larger datasets, contact us, and we will strive to accommodate your needs.

  • File organization: Files within a dataset are arranged alphabetically. If a specific order is necessary, prepend numerical prefixes (e.g., 01_.xls, 02_.pdf) to the file names or use the dataset's summary to guide others through the data and analysis.

Collaboration and duplication prevention

  • Avoid duplication: Before uploading, check if the dataset already exists. If it does, consider collaborating or linking directly to that dataset instead of creating a duplicate. While maintaining your own copy is fine, sharing through collaboration or direct linking helps consolidate the dataset's narrative in a single location.