Cargando…
Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine
The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can f...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5130168/ https://www.ncbi.nlm.nih.gov/pubmed/27902695 http://dx.doi.org/10.1371/journal.pcbi.1005017 |
_version_ | 1782470683980201984 |
---|---|
author | Singhal, Ayush Simmons, Michael Lu, Zhiyong |
author_facet | Singhal, Ayush Simmons, Michael Lu, Zhiyong |
author_sort | Singhal, Ayush |
collection | PubMed |
description | The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F(1)-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships. |
format | Online Article Text |
id | pubmed-5130168 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-51301682016-12-15 Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine Singhal, Ayush Simmons, Michael Lu, Zhiyong PLoS Comput Biol Research Article The practice of precision medicine will ultimately require databases of genes and mutations for healthcare providers to reference in order to understand the clinical implications of each patient’s genetic makeup. Although the highest quality databases require manual curation, text mining tools can facilitate the curation process, increasing accuracy, coverage, and productivity. However, to date there are no available text mining tools that offer high-accuracy performance for extracting such triplets from biomedical literature. In this paper we propose a high-performance machine learning approach to automate the extraction of disease-gene-variant triplets from biomedical literature. Our approach is unique because we identify the genes and protein products associated with each mutation from not just the local text content, but from a global context as well (from the Internet and from all literature in PubMed). Our approach also incorporates protein sequence validation and disease association using a novel text-mining-based machine learning approach. We extract disease-gene-variant triplets from all abstracts in PubMed related to a set of ten important diseases (breast cancer, prostate cancer, pancreatic cancer, lung cancer, acute myeloid leukemia, Alzheimer’s disease, hemochromatosis, age-related macular degeneration (AMD), diabetes mellitus, and cystic fibrosis). We then evaluate our approach in two ways: (1) a direct comparison with the state of the art using benchmark datasets; (2) a validation study comparing the results of our approach with entries in a popular human-curated database (UniProt) for each of the previously mentioned diseases. In the benchmark comparison, our full approach achieves a 28% improvement in F(1)-measure (from 0.62 to 0.79) over the state-of-the-art results. For the validation study with UniProt Knowledgebase (KB), we present a thorough analysis of the results and errors. Across all diseases, our approach returned 272 triplets (disease-gene-variant) that overlapped with entries in UniProt and 5,384 triplets without overlap in UniProt. Analysis of the overlapping triplets and of a stratified sample of the non-overlapping triplets revealed accuracies of 93% and 80% for the respective categories (cumulative accuracy, 77%). We conclude that our process represents an important and broadly applicable improvement to the state of the art for curation of disease-gene-variant relationships. Public Library of Science 2016-11-30 /pmc/articles/PMC5130168/ /pubmed/27902695 http://dx.doi.org/10.1371/journal.pcbi.1005017 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication. |
spellingShingle | Research Article Singhal, Ayush Simmons, Michael Lu, Zhiyong Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine |
title | Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine |
title_full | Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine |
title_fullStr | Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine |
title_full_unstemmed | Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine |
title_short | Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine |
title_sort | text mining genotype-phenotype relationships from biomedical literature for database curation and precision medicine |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5130168/ https://www.ncbi.nlm.nih.gov/pubmed/27902695 http://dx.doi.org/10.1371/journal.pcbi.1005017 |
work_keys_str_mv | AT singhalayush textmininggenotypephenotyperelationshipsfrombiomedicalliteraturefordatabasecurationandprecisionmedicine AT simmonsmichael textmininggenotypephenotyperelationshipsfrombiomedicalliteraturefordatabasecurationandprecisionmedicine AT luzhiyong textmininggenotypephenotyperelationshipsfrombiomedicalliteraturefordatabasecurationandprecisionmedicine |