Documentation

Welcome to the Semantic Data Dictionary wiki!

This wiki contains documentation on the Semantic Data Dictionary (SDD). This approach provides a way to annotate data such that entities in a dataset and their relationships can be accurately represented by encoding mappings to a background set of ontologies.

SDD Workflow

The Semantic Data Dictionary is a specification formalizing how to assign a semantic representation of data by annotating dataset variables and their values using concepts from best practice vocabularies and ontologies. It is a collection of individual documents that each play a role in creating a concise and consistent knowledge representation, including the Dictionary Mapping, Codebook, Timeline, and Code Mapping specifications, and the Infosheet, which is used to link these Semantic Data Dictionary elements together.

We implement the SDD as a collection of tabular data sheets which can be written in Excel or Comma Separated Value (CSV) files. In order to organize the collection of sheets in the SDD, we use the Infosheet, which contains information about the Semantic Data Dictionary data model being described, as well as the location of the other SDD tables.

Infosheet

The Infosheet is essentially the configuration document of the SDD structure. Thus it used to organize the SDD tables and contains information about the Semantic Data Dictionary, such as the name, identifier, or link to the documentation, in addition to the location of the other SDD tables. The SDD tables are usually a collection of CSV files that have the majority of the information on the dataset and their relationships.

Infosheet Row	Related Property	Description	Example
Code Mapping		Reference to Code Mapping table location	http://…
Codebook		Reference to Codebook table location	http://…
Dictionary Mapping		Reference to Dictionary Mapping table location	http://…
Imports	owl:imports	Ontologies that the SDD references	http://semanticscience.org/ontology/sio-subset-labels.owl
Timeline		Reference to Timeline table location	http://…

The info sheet should follow Distribution Level Dataset Description based on the HCLS standards and the Data on the Web best practices.

Dictionary Mapping

The bulk of the annotation is done using the Dictionary Mapping (DM) table, which is used to annotate the columns of a given dataset. The DM table contains entries describing concepts explicit in the original dataset, as well as implicit entries. The explicit entries contain mappings to the underlying attribute that is described by a particular dataset column, as well as provenance information such as how that variable was generated or derived. Implicit entries are used to describe entities that are implicit within the dataset, usually related to one or more of the explicit entries in the dataset, such as the entity being measured, or the time at which a measurement was taken. These entities are readily recognized by human data users even though there is no column in the dataset that refers to them directly, but we must make them explicit for machines. These implicit entities can then be described with type, role, relation, and other information in the same manner as the explicit columns in the dataset. The SDD DM Specification is shown below.

DM Column	Related Property	Description
Attribute	rdf:type	Class of attribute entry
attributeOf	sio:isAttributeOf	Entity having the attribute
Column		Entry column header in dataset
Comment	rdfs:comment	Comment for the entry
Definition	skos:definition	Entry text definition
Entity	rdf:type	Class of entity entry
Format		Specifies the structure of the Unit value
inRelationTo	sio:inRelationTo	Entity that the role is linked to
Label	rdfs:label	Label for the entry
Property		Custom datatype property specification
Relation		Custom relation that replaces inRelationTo
Role	sio:hasRole	Type of the role of the entry
Time	sio:existsAt	Time point of measurement
Unit	sio:hasUnit	Unit of Measure for entry
wasDerivedFrom	prov:wasDerivedFrom	Entity from which the entry was derived
wasGeneratedBy	prov:wasGeneratedBy	Activity from which the entry was produced

The names of the explicit and implicit entries are stored in the DM Column of the Dictionary Mapping table called “Column”, which refers to the column names in the dataset. Annotation properties including comments, labels, or definitions can be provided to describe an explicit or virtual entry in further detail, for the human reader. If an entry describes a characteristic, the Attribute column should be populated with an appropriate class, and when appropriate the attributeOf column should be used to reference the entity to which the attribute belongs. Usually the attributeOf column contains the the implicit entity for which the explicit entry is a characteristic.

For instance, using the example from below, a dataset may contain a column called age. Although it may be easy for the human to read and understand that the age refers to the mother, the dataset often doesn’t explicitly include this; if we aim the crate a full semantic representation of the data this relation must be defined for the computer. However, since mother is not a column in the dataset, it is considered to be an implicit entry, denoted by the ‘??’ in front of the word mother.

If an entry describes an object, an applicable class should be included in the Entity column as well as its role that differentiates it as a member of that object class. Using the same example, if we look at the column with the implicit entry for mother (??mother), mother is not a characteristic of a human, but rather a type of human, the entity. Thus we add the class sio:Human to the entity column, and the role the defines its type in relation to the entity in the role column (chear:Mother).

In general, for each row in the Dictionary Mapping, either the Entity or Attribute column should be populated with an appropriate class, but we must be careful where and how we classify the entry. For instance, if we had a column called Race it would be the attribute of a human, but a column called Caucasain would be related to the entity human. In order to understand the correct columns to fill one may find it useful to explore the hierarchy of the relevant term.

In the case that an entry references another item, the object should be stored in the inRelationTo column. By default, if both the Role and Relation are empty, the knowledge graph created will connect the main entry to the value in the inRelationTo column using the SIO property sio:inRelationTo. Both the Relation and Role column have the ability to overwrite this property and can denote a custom relationship between the value. One common example of this is the SIO property isPartOf. In the case that both columns are filled, the Role reference for that entry in the knowledge graph will be sio:hasRole. It is important to note that the Role column can be used independently of the inRelationto column.

The units of a given variable and the format of the data in the cell can be specified in the Unit and Format columns, respectively. For instance if the one of the entries is the age of an object, the unit column shouldd specify whether the age is stored in years, days, months, etc.

Time instances (events) or time intervals associated with an entry should be referenced in the Time column. Time intervals are not solitary occurances but rather things that span a period of time. In our example, an example of the is the ??visit1, which represents visits anytime during the first trimester of pregnancy. Entries in the Time column that are Time Intervals, not Time Instances, should also be noted in the Timeline.

Provenance information pertaining to how the variable was derived or generated can be included in the wasDerivedFrom and wasGeneratedBy columns, respectively. An example Dictionary Mapping from the CHEAR project is provided below.

Column	Attribute	attributeOf	Entity	Unit	Time	Role	inRelationTo	wasDerivedFrom
id	sio:Identifier	??child					??study
race	sio:Race	??mother
age	sio:Age	??mother		sio:Year	??visit1
edu	chear:EducationLevel	??mother			??visit2
insur	chear:InsuranceType	??mother			??visit3
urineam_3	sio:Quality	??sample3			??visit3
t1bmi	chear:BMI	??mother		kg/m2	??visit1			t1weight, ??height
t1weight	chear:Weight	??mother		kg	??visit1
smoke	chear:SmokingStatus	??mother			??pregnancy
birthwt	chear:Weight	??child		g	??birth
??height	sio:Height	??mother
??mother			sio:Human			chear:Mother	??child
??child			sio:Human			chear:Child	??mother
??birth			sio:Birthing				??child
??pregnancy			chear:PregnancyPeriod				??birth
??conception			chear:Conception				??child
??sample3			U					??mother

Codebook

The Codebook table has a similar role to the Dictionary Mapping table; while the Dictionary Mapping serves to encode the meanings of the column headers in the dataset, the Codebook contains the all the possible values for each column, and their associated labels. For instance, if you had a dataset which contained information about people, one of the possible columns may be gender, which would be the entry in the Dictionary Mapping, while the possible entries for that column, male and female, would be entries in the Codebook.

Codebook Column	Related Property	Description
Class	rdf:type	Class the Code refers to
Code	sio:hasValue	Value of the dataset entry
Column		Entry column header in dataset
Comment	rdfs:comment	Comment for the codebook entry
Definition	skos:definition	Definition for the codebook entry
Label	rdfs:label	Label for the codebook entry
Resource	rdf:type	Resource URI the Code refers to

For variables with discrete values, when appropriate, we augment each possible value with mappings to corresponding concepts, as shown in the table below.

Column	Code	Label	Class
race	0	white	chear:White
race	1	black	chear:BlackOrAfricanAmerican
race	2	other	chear:OtherRace
edu	0	high school degree or less	chear:HighSchoolOrLess
edu	1	technical college or some college	chear:SomeCollege
edu	2	college graduate	chear:CollegeGraduate
edu	3	above	chear:AdvancedDegree
smoke	0	no smoking in pregnancy	chear:NonSmoker
smoke	1	some smoking in pregnancy	chear:Smoker
insur	0	private/hmo/self-pay	chear:NoPublicInsurance
insur	1	public	chear:PublicInsurance

Timeline

Customized time intervals can be specified in the Timeline sheet, which can be used to annotate the corresponding class and unit related to a given entry, as well start and end times of an event, and a connection to concepts that the entry may be related to. When using the timeline ensure that the time entry in the table is a time interval rather than a time instance. For example, a birthday would not be in the timeline, rather it should be viewed as a characteristic of a subject. On the other hand, in the CHEAR study, the data tracks child development in terms of observations taken at specific times relative to the birth or conception of the child. Comparing measurements across subjects for a particular time such as “the second trimester of pregnancy” requires we have a concept to describe this time interval, even though it will not necessarily fall during the same calendar week for any two subjects.

The Timeline Specification is shown below.

Timeline Column	Related Property	Description
End	sio:hasEndTime	End time associated with the timeline entry
inRelationTo	sio:inRelationTo	What the timeline entry is in relation to
Label	rdfs:label	Label for the timeline entry
Name		Reference to the virtual timeline entry
Start	sio:hasStartTime	Start time associated with the timeline entry
Type	rdf:type	Class of the timeline entry
Unit	sio:hasUnit	Unit of time

An example Timeline table is shown below.

Name	Label	Type	Start	End	Unit	inRelationTo
??visit1	Visit 1	chear:Visit	4.71	19.1	sio:Week	??conception
??visit2	Visit 2	chear:Visit	14.9	32.1	sio:Week	??conception
??visit3	Visit 3	chear:Visit	22.9	38.3	sio:Week	??conception

Code Mappings

The Code Mappings table contains mappings of abbreviated terms or units to their corresponding ontology concepts.

This aids the annotator in allowing the use of shorthand notations instead of having to repeated search for the URI of the ontology class. An example set of code mappings is shown below.

code	uri	label
Pb	chebi:25016	Lead
S	uberon:0001977	Serum
cm	obo:UO_0000015	centimeter
kg	obo:UO_0000009	kilogram
kg/m2	obo:UO_0000086	kilogram per square meter
mgL	obo:UO_0000273	milligrams per liter

The set of code mappings used in the CHEAR project are useful for a variety of domains, and can be found on GitHub.

Configuration

The config.ini file is the configuration file used by the sdd2rdf script. Note that file locations written in this config file can be absolute paths or URLs, as well as relative paths from the location that the sdd2rdf.py script exists. An example configuration file is shown below.

You may notice that the file addresses are stored in both the infosheet and the configuration file. In the scenario that the locations are different in the two files, the script will use the infosheet.

[Prefixes]
# Specify a file with the prefixes for existing ontologies used in your translation
prefixes = Cheese/config/prefixes.txt
# Specify the base uri to be associated with all triples minted by the script
base_uri = cheese-kb

[Source Files]
# Specify the location of the Dictionary Mapping file
dictionary = Cheese/input/DM/cheeseDM.csv
# Specify the location of the Codebook file
codebook = Cheese/input/CB/cheeseCB.csv
# Specify the location of the Timeline file
timeline = Cheese/input/TL/cheeseTL.csv
# Specify the location of the Code Mapping file
code_mappings = Cheese/config/code_mappings.csv
# Specify the location of the Data file
data_file = Cheese/input/Data/cheese.csv
# Specify the location of the Properties customization file
properties = Cheese/config/cheeseProperties.csv
# Specify the location of the Infosheet file
infosheet = Cheese/config/cheeseInfosheet.csv

[Output Files]
# Specify the location where the output RDF will be written to
out_file = Cheese/output/trig/cheese-kg.trig
# Specify the location where the output SPARQL query will be written to
query_file = Cheese/output/sparql/cheeseQ
# Specify the location where the output SWRL model will be written to
swrl_file = Cheese/output/swrl/cheeseSWRL

Prefixes

The prefixes.csv file is used to specify the namespace URIs for the prefixes used throughout the annotated SDD tables. An example prefix file is shown below.

prefix	url
np	http://www.nanopub.org/nschema#
owl	http://www.w3.org/2002/07/owl#
rdf	http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs	http://www.w3.org/2000/01/rdf-schema#
prov	http://www.w3.org/ns/prov#
xsd	http://www.w3.org/2001/XMLSchema#
uo	http://purl.obolibrary.org/obo/UO_
sio	http://semanticscience.org/resource/
stato	http://purl.obolibrary.org/obo/STATO_
example-kb	http://example.com/kb/example#

Note that the prefix that you include in the configuration file as the base URI should also be included in the prefixes file.

Property Customization

Customization of properties used in generating KG

The Semantic Data Dictionary approach creates a linked representation of the class or collection of datasets it describes.

The default model that sdd2rdf creates is based on the Semanticscience Integrated Ontology (SIO), which can be used to describe a wide variety of objects using a fixed set of terms.

The default model that we adopt further incorporates annotation properties from RDFS and SKOS, and provenance predicates from PROV-O.

The default set of properties are shown below.

Column	Property
Attribute	rdf:type
attributeOf	sio:isAttributeOf
Comment	rdfs:comment
Definition	skos:definition
Entity	rdf:type
inRelationTo	sio:inRelationTo
Label	rdfs:label
Role	sio:hasRole
Time	sio:existsAt
Unit	sio:hasUnit
Value	sio:hasValue
wasDerivedFrom	prov:wasDerivedFrom
wasGeneratedBy	prov:wasGeneratedBy

By specifying the associated properties with certain columns of the Dictionary Mapping Table, the properties used in generating the knowledge graph can be customized.

This means that it is possible to use an alternate knowledge representation model, and thus makes this approach ontology agnostic.

Nevertheless, we urge the user to practice caution (for example, don’t replace an object property with a datatype property) when customizing the properties used to ensure that the resulting graph is semantically consistent.

Templating

Templating in the Dictionary Mapping (DM) table adopts the scheme used by RML and R2RML.

Essentially, in the Template column of the DM, it is possible to specify the format for the generated URI by specifying a template string and encompassing valid column name(s) within curly brackets: string-{col_name}.

The value in the curly brackets resolves to the value for that column in the current row.