Cargando…

Clustering rare diseases within an ontology-enriched knowledge graph

OBJECTIVE: Identifying sets of rare diseases with shared aspects of etiology and pathophysiology may enable drug repurposing and/or platform based therapeutic development. Toward that aim, we utilized an integrative knowledge graph-based approach to constructing clusters of rare diseases. MATERIALS...

Descripción completa

Detalles Bibliográficos
Autores principales: Sanjak, Jaleal, Zhu, Qian, Mathé, Ewy A.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9949046/
https://www.ncbi.nlm.nih.gov/pubmed/36824742
http://dx.doi.org/10.1101/2023.02.15.528673
_version_ 1784892899436003328
author Sanjak, Jaleal
Zhu, Qian
Mathé, Ewy A.
author_facet Sanjak, Jaleal
Zhu, Qian
Mathé, Ewy A.
author_sort Sanjak, Jaleal
collection PubMed
description OBJECTIVE: Identifying sets of rare diseases with shared aspects of etiology and pathophysiology may enable drug repurposing and/or platform based therapeutic development. Toward that aim, we utilized an integrative knowledge graph-based approach to constructing clusters of rare diseases. MATERIALS AND METHODS: Data on 3,242 rare diseases were extracted from the National Center for Advancing Translational Science (NCATS) Genetic and Rare Diseases Information center (GARD) internal data resources. The rare disease data was enriched with additional biomedical data, including gene and phenotype ontologies, biological pathway data and small molecule-target activity data, to create a knowledge graph (KG). Node embeddings were used to convert nodes into vectors upon which k-means clustering was applied. We validated the disease clusters through semantic similarity and feature enrichment analysis. RESULTS: A node embedding model was trained on the ontology enriched rare disease KG and k-means clustering was applied to the embedding vectors resulting in 37 disease clusters with a mean size of 87 diseases. We validate the disease clusters quantitatively by looking at semantic similarity of clustered diseases, using the Orphanet Rare Disease Ontology. In addition, the clusters were analyzed for enrichment of associated genes, revealing that the enriched genes within clusters were shown to be highly related. DISCUSSION: We demonstrate that node embeddings are an effective method for clustering diseases within a heterogenous KG. Semantically similar diseases and relevant enriched genes have been uncovered within the clusters. Connections between disease clusters and approved or investigational drugs are enumerated for follow-up efforts. CONCLUSION: Our study lays out a method for clustering rare diseases using the graph node embeddings. We develop an easy to maintain pipeline that can be updated when new data on rare diseases emerges. The embeddings themselves can be paired with other representation learning methods for other data types, such as drugs, to address other predictive modeling problems. Detailed subnetwork analysis and in-depth review of individual clusters may lead to translatable findings. Future work will focus on incorporation of additional data sources, with a particular focus on common disease data.
format Online
Article
Text
id pubmed-9949046
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-99490462023-02-24 Clustering rare diseases within an ontology-enriched knowledge graph Sanjak, Jaleal Zhu, Qian Mathé, Ewy A. bioRxiv Article OBJECTIVE: Identifying sets of rare diseases with shared aspects of etiology and pathophysiology may enable drug repurposing and/or platform based therapeutic development. Toward that aim, we utilized an integrative knowledge graph-based approach to constructing clusters of rare diseases. MATERIALS AND METHODS: Data on 3,242 rare diseases were extracted from the National Center for Advancing Translational Science (NCATS) Genetic and Rare Diseases Information center (GARD) internal data resources. The rare disease data was enriched with additional biomedical data, including gene and phenotype ontologies, biological pathway data and small molecule-target activity data, to create a knowledge graph (KG). Node embeddings were used to convert nodes into vectors upon which k-means clustering was applied. We validated the disease clusters through semantic similarity and feature enrichment analysis. RESULTS: A node embedding model was trained on the ontology enriched rare disease KG and k-means clustering was applied to the embedding vectors resulting in 37 disease clusters with a mean size of 87 diseases. We validate the disease clusters quantitatively by looking at semantic similarity of clustered diseases, using the Orphanet Rare Disease Ontology. In addition, the clusters were analyzed for enrichment of associated genes, revealing that the enriched genes within clusters were shown to be highly related. DISCUSSION: We demonstrate that node embeddings are an effective method for clustering diseases within a heterogenous KG. Semantically similar diseases and relevant enriched genes have been uncovered within the clusters. Connections between disease clusters and approved or investigational drugs are enumerated for follow-up efforts. CONCLUSION: Our study lays out a method for clustering rare diseases using the graph node embeddings. We develop an easy to maintain pipeline that can be updated when new data on rare diseases emerges. The embeddings themselves can be paired with other representation learning methods for other data types, such as drugs, to address other predictive modeling problems. Detailed subnetwork analysis and in-depth review of individual clusters may lead to translatable findings. Future work will focus on incorporation of additional data sources, with a particular focus on common disease data. Cold Spring Harbor Laboratory 2023-02-16 /pmc/articles/PMC9949046/ /pubmed/36824742 http://dx.doi.org/10.1101/2023.02.15.528673 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Sanjak, Jaleal
Zhu, Qian
Mathé, Ewy A.
Clustering rare diseases within an ontology-enriched knowledge graph
title Clustering rare diseases within an ontology-enriched knowledge graph
title_full Clustering rare diseases within an ontology-enriched knowledge graph
title_fullStr Clustering rare diseases within an ontology-enriched knowledge graph
title_full_unstemmed Clustering rare diseases within an ontology-enriched knowledge graph
title_short Clustering rare diseases within an ontology-enriched knowledge graph
title_sort clustering rare diseases within an ontology-enriched knowledge graph
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9949046/
https://www.ncbi.nlm.nih.gov/pubmed/36824742
http://dx.doi.org/10.1101/2023.02.15.528673
work_keys_str_mv AT sanjakjaleal clusteringrarediseaseswithinanontologyenrichedknowledgegraph
AT zhuqian clusteringrarediseaseswithinanontologyenrichedknowledgegraph
AT matheewya clusteringrarediseaseswithinanontologyenrichedknowledgegraph