Documentation

Welcome to the Semantic Data Dictionary wiki!

This wiki contains documentation on the Semantic Data Dictionary (SDD). This approach provides a way to annotate data such that entities in a dataset and their relationships can be accurately represented by encoding mappings to a background set of ontologies.

SDD Workflow

The Semantic Data Dictionary is a specification formalizing how to assign a semantic representation of data by annotating dataset variables and their values using concepts from best practice vocabularies and ontologies. It is a collection of individual documents that each play a role in creating a concise and consistent knowledge representation, including the Dictionary Mapping, Codebook, Timeline, and Code Mapping specifications, and the Infosheet, which is used to link these Semantic Data Dictionary elements together.

We implement the SDD as a collection of tabular data sheets which can be written in Excel or Comma Separated Value (CSV) files. In order to organize the collection of sheets in the SDD, we use the Infosheet, which contains information about the Semantic Data Dictionary data model being described, as well as the location of the other SDD tables.

Infosheet

The Infosheet is essentially the configuration document of the SDD structure. Thus it used to organize the SDD tables and contains information about the Semantic Data Dictionary, such as the name, identifier, or link to the documentation, in addition to the location of the other SDD tables. The SDD tables are usually a collection of CSV files that have the majority of the information on the dataset and their relationships.

Infosheet Row Related Property Description Example
Code Mapping   Reference to Code Mapping table location http://…
Codebook   Reference to Codebook table location http://…
Dictionary Mapping   Reference to Dictionary Mapping table location http://…
Imports owl:imports Ontologies that the SDD references http://semanticscience.org/ontology/sio-subset-labels.owl
Timeline   Reference to Timeline table location http://…

The info sheet should follow Distribution Level Dataset Description based on the HCLS standards and the Data on the Web best practices.

Dictionary Mapping

The bulk of the annotation is done using the Dictionary Mapping (DM) table, which is used to annotate the columns of a given dataset. The DM table contains entries describing concepts explicit in the original dataset, as well as implicit entries. The explicit entries contain mappings to the underlying attribute that is described by a particular dataset column, as well as provenance information such as how that variable was generated or derived. Implicit entries are used to describe entities that are implicit within the dataset, usually related to one or more of the explicit entries in the dataset, such as the entity being measured, or the time at which a measurement was taken. These entities are readily recognized by human data users even though there is no column in the dataset that refers to them directly, but we must make them explicit for machines. These implicit entities can then be described with type, role, relation, and other information in the same manner as the explicit columns in the dataset. The SDD DM Specification is shown below.

DM Column Related Property Description
Attribute rdf:type Class of attribute entry
attributeOf sio:isAttributeOf Entity having the attribute
Column   Entry column header in dataset
Comment rdfs:comment Comment for the entry
Definition skos:definition Entry text definition
Entity rdf:type Class of entity entry
Format   Specifies the structure of the Unit value
inRelationTo sio:inRelationTo Entity that the role is linked to
Label rdfs:label Label for the entry
Property   Custom datatype property specification
Relation   Custom relation that replaces inRelationTo
Role sio:hasRole Type of the role of the entry
Time sio:existsAt Time point of measurement
Unit sio:hasUnit Unit of Measure for entry
wasDerivedFrom prov:wasDerivedFrom Entity from which the entry was derived
wasGeneratedBy prov:wasGeneratedBy Activity from which the entry was produced

The names of the explicit and implicit entries are stored in the DM Column of the Dictionary Mapping table called “Column”, which refers to the column names in the dataset. Annotation properties including comments, labels, or definitions can be provided to describe an explicit or virtual entry in further detail, for the human reader. If an entry describes a characteristic, the Attribute column should be populated with an appropriate class, and when appropriate the attributeOf column should be used to reference the entity to which the attribute belongs. Usually the attributeOf column contains the the implicit entity for which the explicit entry is a characteristic.

For instance, using the example from below, a dataset may contain a column called age. Although it may be easy for the human to read and understand that the age refers to the mother, the dataset often doesn’t explicitly include this; if we aim the crate a full semantic representation of the data this relation must be defined for the computer. However, since mother is not a column in the dataset, it is considered to be an implicit entry, denoted by the ‘??’ in front of the word mother.

If an entry describes an object, an applicable class should be included in the Entity column as well as its role that differentiates it as a member of that object class. Using the same example, if we look at the column with the implicit entry for mother (??mother), mother is not a characteristic of a human, but rather a type of human, the entity. Thus we add the class sio:Human to the entity column, and the role the defines its type in relation to the entity in the role column (chear:Mother).

In general, for each row in the Dictionary Mapping, either the Entity or Attribute column should be populated with an appropriate class, but we must be careful where and how we classify the entry. For instance, if we had a column called Race it would be the attribute of a human, but a column called Caucasain would be related to the entity human. In order to understand the correct columns to fill one may find it useful to explore the hierarchy of the relevant term.

In the case that an entry references another item, the object should be stored in the inRelationTo column. By default, if both the Role and Relation are empty, the knowledge graph created will connect the main entry to the value in the inRelationTo column using the SIO property sio:inRelationTo. Both the Relation and Role column have the ability to overwrite this property and can denote a custom relationship between the value. One common example of this is the SIO property isPartOf. In the case that both columns are filled, the Role reference for that entry in the knowledge graph will be sio:hasRole. It is important to note that the Role column can be used independently of the inRelationto column.

The units of a given variable and the format of the data in the cell can be specified in the Unit and Format columns, respectively. For instance if the one of the entries is the age of an object, the unit column shouldd specify whether the age is stored in years, days, months, etc.

Time instances (events) or time intervals associated with an entry should be referenced in the Time column. Time intervals are not solitary occurances but rather things that span a period of time. In our example, an example of the is the ??visit1, which represents visits anytime during the first trimester of pregnancy. Entries in the Time column that are Time Intervals, not Time Instances, should also be noted in the Timeline.

Provenance information pertaining to how the variable was derived or generated can be included in the wasDerivedFrom and wasGeneratedBy columns, respectively. An example Dictionary Mapping from the CHEAR project is provided below.

Column Attribute attributeOf Entity Unit Time Role inRelationTo wasDerivedFrom
id sio:Identifier ??child         ??study  
race sio:Race ??mother            
age sio:Age ??mother   sio:Year ??visit1      
edu chear:EducationLevel ??mother     ??visit2      
insur chear:InsuranceType ??mother     ??visit3      
urineam_3 sio:Quality ??sample3     ??visit3      
t1bmi chear:BMI ??mother   kg/m2 ??visit1     t1weight, ??height
t1weight chear:Weight ??mother   kg ??visit1      
smoke chear:SmokingStatus ??mother     ??pregnancy      
birthwt chear:Weight ??child   g ??birth      
??height sio:Height ??mother            
??mother     sio:Human     chear:Mother ??child  
??child     sio:Human     chear:Child ??mother  
??birth     sio:Birthing       ??child  
??pregnancy     chear:PregnancyPeriod       ??birth  
??conception     chear:Conception       ??child  
??sample3     U         ??mother

Codebook

The Codebook table has a similar role to the Dictionary Mapping table; while the Dictionary Mapping serves to encode the meanings of the column headers in the dataset, the Codebook contains the all the possible values for each column, and their associated labels. For instance, if you had a dataset which contained information about people, one of the possible columns may be gender, which would be the entry in the Dictionary Mapping, while the possible entries for that column, male and female, would be entries in the Codebook.

Codebook Column Related Property Description
Class rdf:type Class the Code refers to
Code sio:hasValue Value of the dataset entry
Column   Entry column header in dataset
Comment rdfs:comment Comment for the codebook entry
Definition skos:definition Definition for the codebook entry
Label rdfs:label Label for the codebook entry
Resource rdf:type Resource URI the Code refers to

For variables with discrete values, when appropriate, we augment each possible value with mappings to corresponding concepts, as shown in the table below.

Column Code Label Class
race 0 white chear:White
race 1 black chear:BlackOrAfricanAmerican
race 2 other chear:OtherRace
edu 0 high school degree or less chear:HighSchoolOrLess
edu 1 technical college or some college chear:SomeCollege
edu 2 college graduate chear:CollegeGraduate
edu 3 above chear:AdvancedDegree
smoke 0 no smoking in pregnancy chear:NonSmoker
smoke 1 some smoking in pregnancy chear:Smoker
insur 0 private/hmo/self-pay chear:NoPublicInsurance
insur 1 public chear:PublicInsurance

Timeline

Customized time intervals can be specified in the Timeline sheet, which can be used to annotate the corresponding class and unit related to a given entry, as well start and end times of an event, and a connection to concepts that the entry may be related to. When using the timeline ensure that the time entry in the table is a time interval rather than a time instance. For example, a birthday would not be in the timeline, rather it should be viewed as a characteristic of a subject. On the other hand, in the CHEAR study, the data tracks child development in terms of observations taken at specific times relative to the birth or conception of the child. Comparing measurements across subjects for a particular time such as “the second trimester of pregnancy” requires we have a concept to describe this time interval, even though it will not necessarily fall during the same calendar week for any two subjects.

The Timeline Specification is shown below.

Timeline Column Related Property Description
End sio:hasEndTime End time associated with the timeline entry
inRelationTo sio:inRelationTo What the timeline entry is in relation to
Label rdfs:label Label for the timeline entry
Name   Reference to the virtual timeline entry
Start sio:hasStartTime Start time associated with the timeline entry
Type rdf:type Class of the timeline entry
Unit sio:hasUnit Unit of time

An example Timeline table is shown below.

Name Label Type Start End Unit inRelationTo
??visit1 Visit 1 chear:Visit 4.71 19.1 sio:Week ??conception
??visit2 Visit 2 chear:Visit 14.9 32.1 sio:Week ??conception
??visit3 Visit 3 chear:Visit 22.9 38.3 sio:Week ??conception

Code Mappings

The Code Mappings table contains mappings of abbreviated terms or units to their corresponding ontology concepts.

This aids the annotator in allowing the use of shorthand notations instead of having to repeated search for the URI of the ontology class. An example set of code mappings is shown below.

code uri label
Pb chebi:25016 Lead
S uberon:0001977 Serum
cm obo:UO_0000015 centimeter
kg obo:UO_0000009 kilogram
kg/m2 obo:UO_0000086 kilogram per square meter
mgL obo:UO_0000273 milligrams per liter

The set of code mappings used in the CHEAR project are useful for a variety of domains, and can be found on GitHub.

Configuration

The config.ini file is the configuration file used by the sdd2rdf script. Note that file locations written in this config file can be absolute paths or URLs, as well as relative paths from the location that the sdd2rdf.py script exists. An example configuration file is shown below.

You may notice that the file addresses are stored in both the infosheet and the configuration file. In the scenario that the locations are different in the two files, the script will use the infosheet.

[Prefixes]
# Specify a file with the prefixes for existing ontologies used in your translation
prefixes = Cheese/config/prefixes.txt
# Specify the base uri to be associated with all triples minted by the script
base_uri = cheese-kb

[Source Files]
# Specify the location of the Dictionary Mapping file
dictionary = Cheese/input/DM/cheeseDM.csv
# Specify the location of the Codebook file
codebook = Cheese/input/CB/cheeseCB.csv
# Specify the location of the Timeline file
timeline = Cheese/input/TL/cheeseTL.csv
# Specify the location of the Code Mapping file
code_mappings = Cheese/config/code_mappings.csv
# Specify the location of the Data file
data_file = Cheese/input/Data/cheese.csv
# Specify the location of the Properties customization file
properties = Cheese/config/cheeseProperties.csv
# Specify the location of the Infosheet file
infosheet = Cheese/config/cheeseInfosheet.csv

[Output Files]
# Specify the location where the output RDF will be written to
out_file = Cheese/output/trig/cheese-kg.trig
# Specify the location where the output SPARQL query will be written to
query_file = Cheese/output/sparql/cheeseQ
# Specify the location where the output SWRL model will be written to
swrl_file = Cheese/output/swrl/cheeseSWRL

Prefixes

The prefixes.csv file is used to specify the namespace URIs for the prefixes used throughout the annotated SDD tables. An example prefix file is shown below.

prefix url
np http://www.nanopub.org/nschema#
owl http://www.w3.org/2002/07/owl#
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs http://www.w3.org/2000/01/rdf-schema#
prov http://www.w3.org/ns/prov#
xsd http://www.w3.org/2001/XMLSchema#
uo http://purl.obolibrary.org/obo/UO_
sio http://semanticscience.org/resource/
stato http://purl.obolibrary.org/obo/STATO_
example-kb http://example.com/kb/example#

Note that the prefix that you include in the configuration file as the base URI should also be included in the prefixes file.

Property Customization

Customization of properties used in generating KG

The Semantic Data Dictionary approach creates a linked representation of the class or collection of datasets it describes.

The default model that sdd2rdf creates is based on the Semanticscience Integrated Ontology (SIO), which can be used to describe a wide variety of objects using a fixed set of terms.

The default model that we adopt further incorporates annotation properties from RDFS and SKOS, and provenance predicates from PROV-O.

The default set of properties are shown below.

Column Property
Attribute rdf:type
attributeOf sio:isAttributeOf
Comment rdfs:comment
Definition skos:definition
Entity rdf:type
inRelationTo sio:inRelationTo
Label rdfs:label
Role sio:hasRole
Time sio:existsAt
Unit sio:hasUnit
Value sio:hasValue
wasDerivedFrom prov:wasDerivedFrom
wasGeneratedBy prov:wasGeneratedBy

By specifying the associated properties with certain columns of the Dictionary Mapping Table, the properties used in generating the knowledge graph can be customized.

This means that it is possible to use an alternate knowledge representation model, and thus makes this approach ontology agnostic.

Nevertheless, we urge the user to practice caution (for example, don’t replace an object property with a datatype property) when customizing the properties used to ensure that the resulting graph is semantically consistent.

Templating

Templating in the Dictionary Mapping (DM) table adopts the scheme used by RML and R2RML.

Essentially, in the Template column of the DM, it is possible to specify the format for the generated URI by specifying a template string and encompassing valid column name(s) within curly brackets: string-{col_name}.

The value in the curly brackets resolves to the value for that column in the current row.