Cargando…

Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine

The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can f...

Descripción completa

Detalles Bibliográficos
Autores principales:	Singhal, Ayush, Simmons, Michael, Lu, Zhiyong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5130168/ https://www.ncbi.nlm.nih.gov/pubmed/27902695 http://dx.doi.org/10.1371/journal.pcbi.1005017

_version_	1782470683980201984
author	Singhal, Ayush Simmons, Michael Lu, Zhiyong
author_facet	Singhal, Ayush Simmons, Michael Lu, Zhiyong
author_sort	Singhal, Ayush
collection	PubMed
description	The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F(1)-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships.
format	Online Article Text
id	pubmed-5130168
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-51301682016-12-15 Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine Singhal, Ayush Simmons, Michael Lu, Zhiyong PLoS Comput Biol Research Article The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F(1)-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships. Public Library of Science 2016-11-30 /pmc/articles/PMC5130168/ /pubmed/27902695 http://dx.doi.org/10.1371/journal.pcbi.1005017 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle	Research Article Singhal, Ayush Simmons, Michael Lu, Zhiyong Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title	Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title_full	Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title_fullStr	Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title_full_unstemmed	Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title_short	Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
title_sort	text mining genotype-phenotype relationships from biomedical literature for database curation and precision medicine
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5130168/ https://www.ncbi.nlm.nih.gov/pubmed/27902695 http://dx.doi.org/10.1371/journal.pcbi.1005017
work_keys_str_mv	AT singhalayush textmininggenotypephenotyperelationshipsfrombiomedicalliteraturefordatabasecurationandprecisionmedicine AT simmonsmichael textmininggenotypephenotyperelationshipsfrombiomedicalliteraturefordatabasecurationandprecisionmedicine AT luzhiyong textmininggenotypephenotyperelationshipsfrombiomedicalliteraturefordatabasecurationandprecisionmedicine

Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine

Ejemplares similares