Welcome to the Semantic Data Dictionary wiki!
This wiki contains documentation on the Semantic Data Dictionary (SDD). This approach provides a way to annotate data such that entities in a dataset and their relationships can be accurately represented by encoding mappings to a background set of ontologies.
The Semantic Data Dictionary is a specification formalizing how to assign a semantic representation of data by annotating dataset variables and their values using concepts from best practice vocabularies and ontologies. It is a collection of individual documents that each play a role in creating a concise and consistent knowledge representation, including the Dictionary Mapping, Codebook, Timeline, and Code Mapping specifications, and the Infosheet, which is used to link these Semantic Data Dictionary elements together.
We implement the SDD as a collection of tabular data sheets which can be written in Excel or Comma Separated Value (CSV) files. In order to organize the collection of sheets in the SDD, we use the Infosheet, which contains information about the Semantic Data Dictionary data model being described, as well as the location of the other SDD tables.
Infosheet
The Infosheet is essentially the configuration document of the SDD structure. Thus it used to organize the SDD tables and contains information about the Semantic Data Dictionary, such as the name, identifier, or link to the documentation, in addition to the location of the other SDD tables. The SDD tables are usually a collection of CSV files that have the majority of the information on the dataset and their relationships.
Infosheet Row | Related Property | Description | Example |
---|---|---|---|
Code Mapping | Reference to Code Mapping table location | http://… | |
Codebook | Reference to Codebook table location | http://… | |
Dictionary Mapping | Reference to Dictionary Mapping table location | http://… | |
Imports | owl:imports | Ontologies that the SDD references | http://semanticscience.org/ontology/sio-subset-labels.owl |
Timeline | Reference to Timeline table location | http://… |
The info sheet should follow Distribution Level Dataset Description based on the HCLS standards and the Data on the Web best practices.
Dictionary Mapping
The bulk of the annotation is done using the Dictionary Mapping (DM) table, which is used to annotate the columns of a given dataset. The DM table contains entries describing concepts explicit in the original dataset, as well as implicit entries. The explicit entries contain mappings to the underlying attribute that is described by a particular dataset column, as well as provenance information such as how that variable was generated or derived. Implicit entries are used to describe entities that are implicit within the dataset, usually related to one or more of the explicit entries in the dataset, such as the entity being measured, or the time at which a measurement was taken. These entities are readily recognized by human data users even though there is no column in the dataset that refers to them directly, but we must make them explicit for machines. These implicit entities can then be described with type, role, relation, and other information in the same manner as the explicit columns in the dataset. The SDD DM Specification is shown below.
DM Column | Related Property | Description |
---|---|---|
Attribute | rdf:type | Class of attribute entry |
attributeOf | sio:isAttributeOf | Entity having the attribute |
Column | Entry column header in dataset | |
Comment | rdfs:comment | Comment for the entry |
Definition | skos:definition | Entry text definition |
Entity | rdf:type | Class of entity entry |
Format | Specifies the structure of the Unit value | |
inRelationTo | sio:inRelationTo | Entity that the role is linked to |
Label | rdfs:label | Label for the entry |
Property | Custom datatype property specification | |
Relation | Custom relation that replaces inRelationTo | |
Role | sio:hasRole | Type of the role of the entry |
Time | sio:existsAt | Time point of measurement |
Unit | sio:hasUnit | Unit of Measure for entry |
wasDerivedFrom | prov:wasDerivedFrom | Entity from which the entry was derived |
wasGeneratedBy | prov:wasGeneratedBy | Activity from which the entry was produced |
The names of the explicit and implicit entries are stored in the DM Column of the Dictionary Mapping table called “Column”, which refers to the column names in the dataset. Annotation properties including comments, labels, or definitions can be provided to describe an explicit or virtual entry in further detail, for the human reader. If an entry describes a characteristic, the Attribute column should be populated with an appropriate class, and when appropriate the attributeOf column should be used to reference the entity to which the attribute belongs. Usually the attributeOf column contains the the implicit entity for which the explicit entry is a characteristic.
For instance, using the example from below, a dataset may contain a column called age. Although it may be easy for the human to read and understand that the age refers to the mother, the dataset often doesn’t explicitly include this; if we aim the crate a full semantic representation of the data this relation must be defined for the computer. However, since mother is not a column in the dataset, it is considered to be an implicit entry, denoted by the ‘??’ in front of the word mother.
If an entry describes an object, an applicable class should be included in the Entity column as well as its role that differentiates it as a member of that object class. Using the same example, if we look at the column with the implicit entry for mother (??mother), mother is not a characteristic of a human, but rather a type of human, the entity. Thus we add the class sio:Human to the entity column, and the role the defines its type in relation to the entity in the role column (chear:Mother).
In general, for each row in the Dictionary Mapping, either the Entity or Attribute column should be populated with an appropriate class, but we must be careful where and how we classify the entry. For instance, if we had a column called Race it would be the attribute of a human, but a column called Caucasain would be related to the entity human. In order to understand the correct columns to fill one may find it useful to explore the hierarchy of the relevant term.
In the case that an entry references another item, the object should be stored in the inRelationTo column. By default, if both the Role and Relation are empty, the knowledge graph created will connect the main entry to the value in the inRelationTo column using the SIO property sio:inRelationTo. Both the Relation and Role column have the ability to overwrite this property and can denote a custom relationship between the value. One common example of this is the SIO property isPartOf. In the case that both columns are filled, the Role reference for that entry in the knowledge graph will be sio:hasRole. It is important to note that the Role column can be used independently of the inRelationto column.
The units of a given variable and the format of the data in the cell can be specified in the Unit and Format columns, respectively. For instance if the one of the entries is the age of an object, the unit column shouldd specify whether the age is stored in years, days, months, etc.
Time instances (events) or time intervals associated with an entry should be referenced in the Time column. Time intervals are not solitary occurances but rather things that span a period of time. In our example, an example of the is the ??visit1, which represents visits anytime during the first trimester of pregnancy. Entries in the Time column that are Time Intervals, not Time Instances, should also be noted in the Timeline.
Provenance information pertaining to how the variable was derived or generated can be included in the wasDerivedFrom and wasGeneratedBy columns, respectively. An example Dictionary Mapping from the CHEAR project is provided below.
Column | Attribute | attributeOf | Entity | Unit | Time | Role | inRelationTo | wasDerivedFrom |
---|---|---|---|---|---|---|---|---|
id | sio:Identifier | ??child | ??study | |||||
race | sio:Race | ??mother | ||||||
age | sio:Age | ??mother | sio:Year | ??visit1 | ||||
edu | chear:EducationLevel | ??mother | ??visit2 | |||||
insur | chear:InsuranceType | ??mother | ??visit3 | |||||
urineam_3 | sio:Quality | ??sample3 | ??visit3 | |||||
t1bmi | chear:BMI | ??mother | kg/m2 | ??visit1 | t1weight, ??height | |||
t1weight | chear:Weight | ??mother | kg | ??visit1 | ||||
smoke | chear:SmokingStatus | ??mother | ??pregnancy | |||||
birthwt | chear:Weight | ??child | g | ??birth | ||||
??height | sio:Height | ??mother | ||||||
??mother | sio:Human | chear:Mother | ??child | |||||
??child | sio:Human | chear:Child | ??mother | |||||
??birth | sio:Birthing | ??child | ||||||
??pregnancy | chear:PregnancyPeriod | ??birth | ||||||
??conception | chear:Conception | ??child | ||||||
??sample3 | U | ??mother |
Codebook
The Codebook table has a similar role to the Dictionary Mapping table; while the Dictionary Mapping serves to encode the meanings of the column headers in the dataset, the Codebook contains the all the possible values for each column, and their associated labels. For instance, if you had a dataset which contained information about people, one of the possible columns may be gender, which would be the entry in the Dictionary Mapping, while the possible entries for that column, male and female, would be entries in the Codebook.
Codebook Column | Related Property | Description |
---|---|---|
Class | rdf:type | Class the Code refers to |
Code | sio:hasValue | Value of the dataset entry |
Column | Entry column header in dataset | |
Comment | rdfs:comment | Comment for the codebook entry |
Definition | skos:definition | Definition for the codebook entry |
Label | rdfs:label | Label for the codebook entry |
Resource | rdf:type | Resource URI the Code refers to |
For variables with discrete values, when appropriate, we augment each possible value with mappings to corresponding concepts, as shown in the table below.
Column | Code | Label | Class |
---|---|---|---|
race | 0 | white | chear:White |
race | 1 | black | chear:BlackOrAfricanAmerican |
race | 2 | other | chear:OtherRace |
edu | 0 | high school degree or less | chear:HighSchoolOrLess |
edu | 1 | technical college or some college | chear:SomeCollege |
edu | 2 | college graduate | chear:CollegeGraduate |
edu | 3 | above | chear:AdvancedDegree |
smoke | 0 | no smoking in pregnancy | chear:NonSmoker |
smoke | 1 | some smoking in pregnancy | chear:Smoker |
insur | 0 | private/hmo/self-pay | chear:NoPublicInsurance |
insur | 1 | public | chear:PublicInsurance |
Timeline
Customized time intervals can be specified in the Timeline sheet, which can be used to annotate the corresponding class and unit related to a given entry, as well start and end times of an event, and a connection to concepts that the entry may be related to. When using the timeline ensure that the time entry in the table is a time interval rather than a time instance. For example, a birthday would not be in the timeline, rather it should be viewed as a characteristic of a subject. On the other hand, in the CHEAR study, the data tracks child development in terms of observations taken at specific times relative to the birth or conception of the child. Comparing measurements across subjects for a particular time such as “the second trimester of pregnancy” requires we have a concept to describe this time interval, even though it will not necessarily fall during the same calendar week for any two subjects.
The Timeline Specification is shown below.
Timeline Column | Related Property | Description |
---|---|---|
End | sio:hasEndTime | End time associated with the timeline entry |
inRelationTo | sio:inRelationTo | What the timeline entry is in relation to |
Label | rdfs:label | Label for the timeline entry |
Name | Reference to the virtual timeline entry | |
Start | sio:hasStartTime | Start time associated with the timeline entry |
Type | rdf:type | Class of the timeline entry |
Unit | sio:hasUnit | Unit of time |
An example Timeline table is shown below.
Name | Label | Type | Start | End | Unit | inRelationTo |
---|---|---|---|---|---|---|
??visit1 | Visit 1 | chear:Visit | 4.71 | 19.1 | sio:Week | ??conception |
??visit2 | Visit 2 | chear:Visit | 14.9 | 32.1 | sio:Week | ??conception |
??visit3 | Visit 3 | chear:Visit | 22.9 | 38.3 | sio:Week | ??conception |
Code Mappings
The Code Mappings table contains mappings of abbreviated terms or units to their corresponding ontology concepts.
This aids the annotator in allowing the use of shorthand notations instead of having to repeated search for the URI of the ontology class. An example set of code mappings is shown below.
code | uri | label |
---|---|---|
Pb | chebi:25016 | Lead |
S | uberon:0001977 | Serum |
cm | obo:UO_0000015 | centimeter |
kg | obo:UO_0000009 | kilogram |
kg/m2 | obo:UO_0000086 | kilogram per square meter |
mgL | obo:UO_0000273 | milligrams per liter |
The set of code mappings used in the CHEAR project are useful for a variety of domains, and can be found on GitHub.
Configuration
The config.ini file is the configuration file used by the sdd2rdf script. Note that file locations written in this config file can be absolute paths or URLs, as well as relative paths from the location that the sdd2rdf.py script exists. An example configuration file is shown below.
You may notice that the file addresses are stored in both the infosheet and the configuration file. In the scenario that the locations are different in the two files, the script will use the infosheet.
[Prefixes] # Specify a file with the prefixes for existing ontologies used in your translation prefixes = Cheese/config/prefixes.txt # Specify the base uri to be associated with all triples minted by the script base_uri = cheese-kb [Source Files] # Specify the location of the Dictionary Mapping file dictionary = Cheese/input/DM/cheeseDM.csv # Specify the location of the Codebook file codebook = Cheese/input/CB/cheeseCB.csv # Specify the location of the Timeline file timeline = Cheese/input/TL/cheeseTL.csv # Specify the location of the Code Mapping file code_mappings = Cheese/config/code_mappings.csv # Specify the location of the Data file data_file = Cheese/input/Data/cheese.csv # Specify the location of the Properties customization file properties = Cheese/config/cheeseProperties.csv # Specify the location of the Infosheet file infosheet = Cheese/config/cheeseInfosheet.csv [Output Files] # Specify the location where the output RDF will be written to out_file = Cheese/output/trig/cheese-kg.trig # Specify the location where the output SPARQL query will be written to query_file = Cheese/output/sparql/cheeseQ # Specify the location where the output SWRL model will be written to swrl_file = Cheese/output/swrl/cheeseSWRL
Prefixes
The prefixes.csv file is used to specify the namespace URIs for the prefixes used throughout the annotated SDD tables. An example prefix file is shown below.
prefix | url |
---|---|
np | http://www.nanopub.org/nschema# |
owl | http://www.w3.org/2002/07/owl# |
rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# |
rdfs | http://www.w3.org/2000/01/rdf-schema# |
prov | http://www.w3.org/ns/prov# |
xsd | http://www.w3.org/2001/XMLSchema# |
uo | http://purl.obolibrary.org/obo/UO_ |
sio | http://semanticscience.org/resource/ |
stato | http://purl.obolibrary.org/obo/STATO_ |
example-kb | http://example.com/kb/example# |
Note that the prefix that you include in the configuration file as the base URI should also be included in the prefixes file.
Property Customization
Customization of properties used in generating KG
The Semantic Data Dictionary approach creates a linked representation of the class or collection of datasets it describes.
The default model that sdd2rdf creates is based on the Semanticscience Integrated Ontology (SIO), which can be used to describe a wide variety of objects using a fixed set of terms.
The default model that we adopt further incorporates annotation properties from RDFS and SKOS, and provenance predicates from PROV-O.
The default set of properties are shown below.
Column | Property |
---|---|
Attribute | rdf:type |
attributeOf | sio:isAttributeOf |
Comment | rdfs:comment |
Definition | skos:definition |
Entity | rdf:type |
inRelationTo | sio:inRelationTo |
Label | rdfs:label |
Role | sio:hasRole |
Time | sio:existsAt |
Unit | sio:hasUnit |
Value | sio:hasValue |
wasDerivedFrom | prov:wasDerivedFrom |
wasGeneratedBy | prov:wasGeneratedBy |
By specifying the associated properties with certain columns of the Dictionary Mapping Table, the properties used in generating the knowledge graph can be customized.
This means that it is possible to use an alternate knowledge representation model, and thus makes this approach ontology agnostic.
Nevertheless, we urge the user to practice caution (for example, don’t replace an object property with a datatype property) when customizing the properties used to ensure that the resulting graph is semantically consistent.
Templating
Templating in the Dictionary Mapping (DM) table adopts the scheme used by RML and R2RML.
Essentially, in the Template column of the DM, it is possible to specify the format for the generated URI by specifying a template string and encompassing valid column name(s) within curly brackets: string-{col_name}.
The value in the curly brackets resolves to the value for that column in the current row.