Data from scientific studies are published in datasets, typically accompanied by data dictionaries and codebooks to support data understanding. The data acquisition methods may also be described in additional documentation to support reproducibility. To conduct rigorous analysis, data users must leverage this documentation to interpret the data correctly. While this process can be burdensome for new data users, it is also prone to errors even for seasoned users. A computational formal model of the knowledge that was used to create the study can facilitate better understanding and thus improved usage of the study data. Knowledge graphs can be used effectively to capture this study’s knowledge.
This tutorial introduces participants to the basics of knowledge graph construction using data, data dictionaries, and codebooks from scientific studies. It will use the Center for Disease Control and Prevention’s (CDC) National Health and Nutrition Examination Surveys (NHANES) data as a testbed and introduce standardized terminology, novel and established techniques, and resources such as scientific/biomedical ontologies, semantic data dictionaries, and knowledge graph frameworks in both lecture and practical sessions. By the end of the tutorial, participants will have created a small knowledge graph that can be accessed to retrieve study knowledge and data.
Contents
Tutorial overview
The SciKG tutorial will be divided into four sections. It will start with an overview of how scientific study data is usually acquired, organized and published, and the current challenges (and opportunities for semantic web) involving the use of this data. Next, we introduce methods for scientific data annotation and terminology reuse. Following, we will give an overview of the state-of-the-art scientific and biomedical ontologies and provide real-world examples of their successful adoption. Finally, the tutorial will introduce knowledge graph frameworks and demonstrate how they can be used to bootstrap and manage scientific KGs.
Section 1: Studies, Data, and Documentation
- Scientific studies and their data acquisition activities
- Scientific data organization
- Scientific data publishing
- Documentation
- Data dictionaries
- Codebooks
- Methods
- National Health and Nutrition Examination Surveys (NHANES)
- Semantics of NHANES data
Section 2: Scientific and Biomedical Ontologies
- The role of standardized terminology in science
- Semanticscience Integrated Ontology (SIO)
- Human-Aware Science Ontology (HAScO)
- Disease Ontology (DOID)
- Chemical Elements of Biological Interest (ChEBI)
Section 3: Semantic Data Dictionaries
- Introduction to Semantic Data Dictionaries (SDDs)
- Structure
- Examples
- Practical section with NHANES datasets
Section 4: Knowledge Graph Frameworks
- The Human-Aware Data Acquisition Framework (HADatAc)
- Whyis
- Using SDDs to bootstrap Knowledge graphs
- Using SPARQL to retrieve study data
Program
Time (EEST/UTC+3) | Event |
---|---|
9:00 - 10:30 | Part 1: Studies, Data, and Documentation |
10:30 - 11:00 | Break |
11:00 - 12:30 | Part 2: Scientific and Biomedical Ontologies |
12:30 - 14:00 | Lunch |
14:00 - 15:30 | Part 3: Semantic Data Dictionaries |
15:30 - 16:00 | Break |
16:00 - 18:00 | Part 4: Knowledge Graph Frameworks |
Material
Slides |
---|
Intro |
Part 1: Studies, Data, and Documentation |
Part 2: Scientific and Biomedical Ontologies HAScO Ontology |
Part 3: Semantic Data Dictionaries |
Part 4: Knowledge Graph Frameworks (no slides, hands-on content) |
Published proceedings
Proceedings | Paper |
---|---|
ESWC 2023 Workshops and Tutorials Joint Proceedings |
Organizers
Dr. Henrique Santos, Rensselaer Polytechnic Institute
Dr. Santos is the Director of Semantic Applications Research in the Tetherless World Constellation at Rensselaer Polytechnic Institute. His research focuses on knowledge representation, domain-specific reasoning, and explainable artificial intelligence. Dr. Santos has lectured undergraduate courses in Artificial Intelligence and Algorithms and has presented several works at renowned conferences, including the International Semantic Web Conference, and the Extended Semantic Web Conference.
Dr. Paulo Pinheiro, Parcela Semântica
Dr. Pinheiro is a seasoned data scientist and software engineer managing projects at the frontier between artificial intelligence and databases. He has twenty years of hands-on experience with data and knowledge management software development, including in-depth knowledge of data standards, data policies, and information assurance. Dr. Pinheiro has extensive teaching experience including past professorship appointments.
Dr. Jamie P. McCusker, Rensselaer Polytechnic Institute
Dr. McCusker works on Biomedical Semantics. Her current interests are data and provenance interoperability in life sciences. She has worked as a software developer for 11 years in bioinformatics, high-performance computing, data mining, natural language processing, and supply chain auditing. Dr. McCusker has taught numerous courses and tutorials on knowledge graphs, ontology engineering, data science, and semantic science.
Dr. James Masters, Icahn School of Medicine at Mount Sinai
Dr. Masters is the Director of Research Data Services at the Icahn School of Medicine at Mount Sinai and manages the Harmonized Data Repository of the Human Health Analysis Resource Data Center. The Harmonized Data Repository is a knowledge graph built using the HADatAc platform, biomedical ontologies, and the methodologies to be discussed in this tutorial. Before joining Icahn School of Medicine, Dr. Masters developed and run several training courses on Semantic Web applications for users and other stakeholders in the financial services industry.
Sabbir M. Rashid, Rensselaer Polytechnic Institute
Rashid is a Ph.D. candidate at Rensselaer Polytechnic Institute on research related to data annotation and harmonization, ontology engineering, knowledge representation, and various forms of reasoning. His graduate studies have involved research related to the semantic annotation and transformation of data using Semantic Data Dictionaries, applied to deductive and abductive inference techniques over linked health data, such as in the context of chronic diseases like diabetes.
Prof. Deborah L. McGuinness, Rensselaer Polytechnic Institute
Prof. McGuinness is the Tetherless World Senior Constellation Chair and Professor of Computer and Cognitive Science. Prof. McGuinness is also widely known for her leading role in the development of the W3C Recommended Web Ontology Language (OWL), her work on earlier description logic languages and environments, including interdisciplinary semantic data resources, provenance languages, and environments, such as InferenceWeb, PML, and PROV.
Acknowledgements
This tutorial is partially funded by the following projects:
- NIEHS-funded Children’s Health Exposure Analysis Resource (CHEAR), project number 1U2CES026555-01
- NIEHS-funded Human Health Exposure Analysis Resource (HHEAR), project number 5U2CES026555-05
- NIMH-funded Support for the RCADS Data Collection Measure, project number 75N95022C00018-0-9999-1
- IBM-funded Health Empowerment by Analytics, Learning, and Semantics (HEALS) through the AI Horizons Network program
- DARPA-funded Machine Common Sense (MCS), grant number N660011924033
- DARPA-funded Environment-driven Conceptual Learning (ECOLE), grant number HR00112390059
- IARPA-funded Human Interpretable Attribution of Text using Underlying Structure (HIATUS), grant number 2022-22072200002
- NSF-funded Nanomine, award number 1640840
- NSF-funded MaterialsMine, award number 1835648