Skip to main content

Key terms and roles

Data virtualization

Data Virtualization is the first key component. It involves the ability to query data either in place or stored within data.world. This capability allows users to combine multiple data sources seamlessly. Common data stores that the AI context engine interacts with include Snowflake, BigQuery, and SQL Server. For the purpose of training and testing, users can directly upload CSV files into data.world to serve as data sources. Within this system, datasets represent the data tables stored in data.world, while projects serve as containers for queries, configuration files, and temporary files.

Business glossary

The second essential component is the Glossary, which provides business definitions for terms utilized within datasets. Typically, tables or dashboards are represented by running collectors or crawlers to gather metadata from various projects. This collected metadata forms the basis of the Glossary. Definitions, such as what constitutes a customer or an order, are textual and often include who is responsible for maintaining these definitions, referred to as the steward. Although the Glossary is optional, it significantly enhances the context provided to the AI applications, making it a powerful addition.

R2RML mapping

R2RML (RDB to RDF Mapping Language) is a standard for expressing how to map relational database content to RDF (Resource Description Framework) data. It allows for the creation of RDF views over existing relational data, enabling this data to be queried with SPARQL, the RDF query language. R2RML mappings define how tables, columns, and SQL queries in a relational database correspond to RDF triples, facilitating the integration of relational data into the semantic web.

Ontology

Ontology is a set of concepts and categories in a subject area or domain that shows their properties and the relations between them.

  • Books have authors

  • Books have publishers

  • Books are published on a date

  • Books are followed by sequels (other books)

Some of these properties are relationships that connect two of our classes. For example, the property “books have authors” is a relationship that connects our book class and our author class. Other properties, such as “books are published on a date,” are attributes, describing only one class, instead of connecting two classes together.

Semantic model

The Semantic Model is represented by three key files: the Ontology file, the Mappings file, and the Index file.

  • The Ontology file includes entities known as Concepts (e.g., customer or order), characteristics of these entities called Attributes (e.g., customer ID, first name, last name), and the Relationships that connect these entities (e.g., an order was made by a customer). These elements are collectively known as CARS.

  • The Mappings file links the ontology to specific data sources, such as indicating that customer data is located in the customer table.

    Mappings within the system can be either direct or complex. Direct mappings are straightforward one-to-one correspondences between concepts and data, while complex mappings may involve additional clauses or conditions to specify relationships more intricately.

  • The Index file, is a system file, which helps in vectorizing this information, storing it in a specialized knowledge graph vector. It is crucial for the system operation.

    These three files live within a Project in data.world. It is associated with your particular AI context engine application.

To Learn more about the differences between Ontologies and Knowledge graphs see this article.

Developer Chat UI

The Developer Chat UI, an essential part of this component, provides valuable features such as thoughts, visualization of queries, and ontology usage, making it indispensable for developers. Users need to be enabled at the individual level, and support should be contacted for this purpose.

Key Roles and personas

  • Business / Subject Matter Expert (SME):

    • Role: Understand and articulate the core business problem and the key questions that need addressing.

    • Responsibilities:

      • Interpret existing BI dashboards/reports and convey what the reports indicate (For example, “the year-over-year revenue report compares monthly data from a year ago”).

      • Explain the usage and purpose of these reports and the business concepts they represent (For example, “Return Customer” refers to “a customer who has placed an order after having placed a previous order within the past year”).

      • Provide insights into the business expectations surrounding these queries.

    • Important note: SMEs are not required to have in-depth knowledge of the data itself, but should understand how it is applied and its business utility.

  • Data Expert:

    • Role: Handle the technical aspects of data management and modeling.

    • Responsibilities:

      • Data modeling: Identify and model the available data, and, if necessary, create new data views.

      • Data plumbing: Knows how to set up the database; credentials, views, partitions, whatever is relevant for the particular database technology you are using.

      • Answer pipeline-related questions (for example, “Where does this data originate?”) and is familiar with the data schema, “We don’t have a separate table for address, we repeat that information everywhere it shows up; it is inconvenient, but that is how it is always been done").

  • Knowledge Engineer:

    • Role: Act as the intermediary between Data Experts and Business/Subject Matter Experts.

    • Responsibilities:

      • Collaborate with Business/Subject Matter Experts to comprehend business queries and define the necessary ontology and mappings.

      • Work with Data Experts to identify the data that corresponds to the required ontology.

      • Possess strong communication skills to effectively bridge the terminology and understanding between the other two personas.

  • Security team: During preparation tasks, ensure access to a member of your Security team. They should be on standby to provide necessary access to data sources, databases, and other required resources.