Cargando…

Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine

The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can f...

Descripción completa

Detalles Bibliográficos
Autores principales: Singhal, Ayush, Simmons, Michael, Lu, Zhiyong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5130168/
https://www.ncbi.nlm.nih.gov/pubmed/27902695
http://dx.doi.org/10.1371/journal.pcbi.1005017
_version_ 1782470683980201984
author Singhal, Ayush
Simmons, Michael
Lu, Zhiyong
author_facet Singhal, Ayush
Simmons, Michael
Lu, Zhiyong
author_sort Singhal, Ayush
collection PubMed
description The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F(1)-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships.
format Online
Article
Text
id pubmed-5130168
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-51301682016-12-15 Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine Singhal, Ayush Simmons, Michael Lu, Zhiyong PLoS Comput Biol Research Article The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F(1)-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships. Public Library of Science 2016-11-30 /pmc/articles/PMC5130168/ /pubmed/27902695 http://dx.doi.org/10.1371/journal.pcbi.1005017 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle Research Article
Singhal, Ayush
Simmons, Michael
Lu, Zhiyong
Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title_full Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title_fullStr Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title_full_unstemmed Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title_short Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title_sort text mining genotype-phenotype relationships from biomedical literature for database curation and precision medicine
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5130168/
https://www.ncbi.nlm.nih.gov/pubmed/27902695
http://dx.doi.org/10.1371/journal.pcbi.1005017
work_keys_str_mv AT singhalayush textmininggenotypephenotyperelationshipsfrombiomedicalliteraturefordatabasecurationandprecisionmedicine
AT simmonsmichael textmininggenotypephenotyperelationshipsfrombiomedicalliteraturefordatabasecurationandprecisionmedicine
AT luzhiyong textmininggenotypephenotyperelationshipsfrombiomedicalliteraturefordatabasecurationandprecisionmedicine