Cargando…

dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text

BACKGROUND: Discerning the genetic contributions to complex human diseases is a challenging mandate that demands new types of data and calls for new avenues for advancing the state-of-the-art in computational approaches to uncovering disease etiology. Systems approaches to studying observable phenot...

Descripción completa

Detalles Bibliográficos
Autores principales: Xu, Rong, Li, Li, Wang, QuanQiu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3998061/
https://www.ncbi.nlm.nih.gov/pubmed/24725842
http://dx.doi.org/10.1186/1471-2105-15-105
_version_ 1782313291785175040
author Xu, Rong
Li, Li
Wang, QuanQiu
author_facet Xu, Rong
Li, Li
Wang, QuanQiu
author_sort Xu, Rong
collection PubMed
description BACKGROUND: Discerning the genetic contributions to complex human diseases is a challenging mandate that demands new types of data and calls for new avenues for advancing the state-of-the-art in computational approaches to uncovering disease etiology. Systems approaches to studying observable phenotypic relationships among diseases are emerging as an active area of research for both novel disease gene discovery and drug repositioning. Currently, systematic study of disease relationships on a phenome-wide scale is limited due to the lack of large-scale machine understandable disease phenotype relationship knowledge bases. Our study innovates a semi-supervised iterative pattern learning approach that is used to build an precise, large-scale disease-disease risk relationship (D1 →D2) knowledge base (dRiskKB) from a vast corpus of free-text published biomedical literature. RESULTS: 21,354,075 MEDLINE records comprised the text corpus under study. First, we used one typical disease risk-specific syntactic pattern (i.e. “D1 due to D2”) as a seed to automatically discover other patterns specifying similar semantic relationships among diseases. We then extracted D1 →D2 risk pairs from MEDLINE using the learned patterns. We manually evaluated the precisions of the learned patterns and extracted pairs. Finally, we analyzed the correlations between disease-disease risk pairs and their associated genes and drugs. The newly created dRiskKB consists of a total of 34,448 unique D1 →D2 pairs, representing the risk-specific semantic relationships among 12,981 diseases with each disease linked to its associated genes and drugs. The identified patterns are highly precise (average precision of 0.99) in specifying the risk-specific relationships among diseases. The precisions of extracted pairs are 0.919 for those that are exactly matched and 0.988 for those that are partially matched. By comparing the iterative pattern approach starting from different seeds, we demonstrated that our algorithm is robust in terms of seed choice. We show that diseases and their risk diseases as well as diseases with similar risk profiles tend to share both genes and drugs. CONCLUSIONS: This unique dRiskKB, when combined with existing phenotypic, genetic, and genomic datasets, can have profound implications in our deeper understanding of disease etiology and in drug repositioning.
format Online
Article
Text
id pubmed-3998061
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-39980612014-05-08 dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text Xu, Rong Li, Li Wang, QuanQiu BMC Bioinformatics Research Article BACKGROUND: Discerning the genetic contributions to complex human diseases is a challenging mandate that demands new types of data and calls for new avenues for advancing the state-of-the-art in computational approaches to uncovering disease etiology. Systems approaches to studying observable phenotypic relationships among diseases are emerging as an active area of research for both novel disease gene discovery and drug repositioning. Currently, systematic study of disease relationships on a phenome-wide scale is limited due to the lack of large-scale machine understandable disease phenotype relationship knowledge bases. Our study innovates a semi-supervised iterative pattern learning approach that is used to build an precise, large-scale disease-disease risk relationship (D1 →D2) knowledge base (dRiskKB) from a vast corpus of free-text published biomedical literature. RESULTS: 21,354,075 MEDLINE records comprised the text corpus under study. First, we used one typical disease risk-specific syntactic pattern (i.e. “D1 due to D2”) as a seed to automatically discover other patterns specifying similar semantic relationships among diseases. We then extracted D1 →D2 risk pairs from MEDLINE using the learned patterns. We manually evaluated the precisions of the learned patterns and extracted pairs. Finally, we analyzed the correlations between disease-disease risk pairs and their associated genes and drugs. The newly created dRiskKB consists of a total of 34,448 unique D1 →D2 pairs, representing the risk-specific semantic relationships among 12,981 diseases with each disease linked to its associated genes and drugs. The identified patterns are highly precise (average precision of 0.99) in specifying the risk-specific relationships among diseases. The precisions of extracted pairs are 0.919 for those that are exactly matched and 0.988 for those that are partially matched. By comparing the iterative pattern approach starting from different seeds, we demonstrated that our algorithm is robust in terms of seed choice. We show that diseases and their risk diseases as well as diseases with similar risk profiles tend to share both genes and drugs. CONCLUSIONS: This unique dRiskKB, when combined with existing phenotypic, genetic, and genomic datasets, can have profound implications in our deeper understanding of disease etiology and in drug repositioning. BioMed Central 2014-04-12 /pmc/articles/PMC3998061/ /pubmed/24725842 http://dx.doi.org/10.1186/1471-2105-15-105 Text en Copyright © 2014 Xu et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Xu, Rong
Li, Li
Wang, QuanQiu
dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text
title dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text
title_full dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text
title_fullStr dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text
title_full_unstemmed dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text
title_short dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text
title_sort driskkb: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3998061/
https://www.ncbi.nlm.nih.gov/pubmed/24725842
http://dx.doi.org/10.1186/1471-2105-15-105
work_keys_str_mv AT xurong driskkbalargescalediseasediseaseriskrelationshipknowledgebaseconstructedfrombiomedicaltext
AT lili driskkbalargescalediseasediseaseriskrelationshipknowledgebaseconstructedfrombiomedicaltext
AT wangquanqiu driskkbalargescalediseasediseaseriskrelationshipknowledgebaseconstructedfrombiomedicaltext