Skip to the content.

Data from scientific studies are published in datasets, typically accompanied by data dictionaries and codebooks to support data understanding. The data acquisition methods may also be described in additional documentation to support reproducibility. To conduct rigorous analysis, data users need to leverage this documentation to correctly interpret the data. While this process can be burdensome for new data users, it is also prone to errors even for seasoned users. A computational formal model of the knowledge that was used to create the study can facilitate better understanding and thus improved usage of the study data. Knowledge graphs can be used effectively to capture this study knowledge.

This tutorial aims to introduce participants to the basics of knowledge graph construction using data, data dictionaries, and codebooks from scientific studies. It will use the Center for Disease Control and Prevention’s (CDC) National Health and Nutrition Examination Surveys (NHANES) data as a testbed and introduce standardized terminology, novel and established techniques, and resources such as scientific/biomedical ontologies, semantic data dictionaries, and knowledge graph frameworks in both lecture and practical sessions. By the end of the tutorial, participants will have created a small knowledge graph that can be accessed to retrieve study knowledge and data.

Contents

Tutorial overview

The SciKG tutorial will be divided into four sections. It will start with an overview of how scientific study data is usually acquired, organized and published, and the current challenges (and opportunities for semantic web) involving the use of this data. Next, we introduce methods for scientific data annotation and terminology reuse. Following, we will give an overview of the state-of-the-art scientific and biomedical ontologies and provide real-world examples of their successful adoption. Finally, the tutorial will introduce knowledge graph frameworks and demonstrate how they can be used to bootstrap and manage scientific KGs.

Section 1: Studies, Data, and Documentation

Section 2: Scientific and Biomedical Ontologies

Section 3: Semantic Data Dictionaries

Section 4: Knowledge Graph Frameworks

Program

Time (EEST/UTC+3) Event
9:00 - 10:30 Part 1: Studies, Data, and Documentation
10:30 - 11:00 Break
11:00 - 12:30 Part 2: Scientific and Biomedical Ontologies
12:30 - 14:00 Lunch
14:00 - 15:30 Part 3: Semantic Data Dictionaries
15:30 - 16:00 Break
16:00 - 18:00 Part 4: Knowledge Graph Frameworks

Material

Slides
Intro
Part 1: Studies, Data, and Documentation
Part 2: Scientific and Biomedical Ontologies
HAScO Ontology
Part 3: Semantic Data Dictionaries
Part 4: Knowledge Graph Frameworks (no slides, hands-on content)

Published proceedings

Proceedings Paper
ESWC 2023 Workshops and Tutorials Joint Proceedings PDF

Organizers

Dr. Henrique Santos, Rensselaer Polytechnic Institute

Dr. Santos is the Director of Semantic Applications Research in the Tetherless World Constellation at Rensselaer Polytechnic Institute. His research focuses on knowledge representation, domain-specific reasoning, and explainable artificial intelligence. Dr. Santos has lectured undergraduate courses in Artificial Intelligence and Algorithms and has presented several works at renowned conferences, including the International Semantic Web Conference, and the Extended Semantic Web Conference.

Dr. Paulo Pinheiro, Parcela Semântica

Dr. Pinheiro is a seasoned data scientist and software engineer managing projects at the frontier between artificial intelligence and databases. He has twenty years of hands-on experience with data and knowledge management software development, including in-depth knowledge of data standards, data policies, and information assurance. Dr. Pinheiro has extensive teaching experience including past professorship appointments.

Dr. Jamie P. McCusker, Rensselaer Polytechnic Institute

Dr. McCusker works on Biomedical Semantics. Her current interests are data and provenance interoperability in life sciences. She has worked as a software developer for 11 years in bioinformatics, high-performance computing, data mining, natural language processing, and supply chain auditing. Dr. McCusker has taught numerous courses and tutorials on knowledge graphs, ontology engineering, data science, and semantic science.

Dr. James Masters, Icahn School of Medicine at Mount Sinai

Dr. Masters is the Director of Research Data Services at the Icahn School of Medicine at Mount Sinai and manages the Harmonized Data Repository of the Human Health Analysis Resource Data Center. The Harmonized Data Repository is a knowledge graph built using the HADatAc platform, biomedical ontologies, and the methodologies to be discussed in this tutorial. Before joining Icahn School of Medicine, Dr. Masters developed and run several training courses on Semantic Web applications for users and other stakeholders in the financial services industry.

Sabbir M. Rashid, Rensselaer Polytechnic Institute

Rashid is a Ph.D. candidate at Rensselaer Polytechnic Institute on research related to data annotation and harmonization, ontology engineering, knowledge representation, and various forms of reasoning. His graduate studies have involved research related to the semantic annotation and transformation of data using Semantic Data Dictionaries, applied to deductive and abductive inference techniques over linked health data, such as in the context of chronic diseases like diabetes.

Prof. Deborah L. McGuinness, Rensselaer Polytechnic Institute

Prof. McGuinness is the Tetherless World Senior Constellation Chair and Professor of Computer and Cognitive Science. Prof. McGuinness is also widely known for her leading role in the development of the W3C Recommended Web Ontology Language (OWL), her work on earlier description logic languages and environments, including interdisciplinary semantic data resources, provenance languages, and environments, such as InferenceWeb, PML, and PROV.

Acknowledgements

This tutorial is partially funded by the following projects: