Cargando…

Machine learning approach to literature mining for the genetics of complex diseases

To generate a parsimonious gene set for understanding the mechanisms underlying complex diseases, we reasoned it was necessary to combine the curation of public literature, review of experimental databases and interpolation of pathway-associated genes. Using this strategy, we previously built the fo...

Descripción completa

Detalles Bibliográficos
Autores principales: Schuster, Jessica, Superdock, Michael, Agudelo, Anthony, Stey, Paul, Padbury, James, Sarkar, Indra Neil, Uzun, Alper
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6877776/
https://www.ncbi.nlm.nih.gov/pubmed/31768545
http://dx.doi.org/10.1093/database/baz124
_version_ 1783473405476995072
author Schuster, Jessica
Superdock, Michael
Agudelo, Anthony
Stey, Paul
Padbury, James
Sarkar, Indra Neil
Uzun, Alper
author_facet Schuster, Jessica
Superdock, Michael
Agudelo, Anthony
Stey, Paul
Padbury, James
Sarkar, Indra Neil
Uzun, Alper
author_sort Schuster, Jessica
collection PubMed
description To generate a parsimonious gene set for understanding the mechanisms underlying complex diseases, we reasoned it was necessary to combine the curation of public literature, review of experimental databases and interpolation of pathway-associated genes. Using this strategy, we previously built the following two databases for reproductive disorders: The Database for Preterm Birth (dbPTB) and The Database for Preeclampsia (dbPEC). The completeness and accuracy of these databases is essential for supporting our understanding of these complex conditions. Given the exponential increase in biomedical literature, it is becoming increasingly difficult to manually maintain these databases. Using our curated databases as reference data sets, we implemented a machine learning-based approach to optimize article selection for manual curation. We used logistic regression, random forests and neural networks as our machine learning algorithms to classify articles. We examined features derived from abstract text, annotations and metadata that we hypothesized would best classify articles with genetically relevant content associated to the disorder of interest. Combinations of these features were used build the classifiers and the performance of these feature sets were compared to a standard ‘Bag-of-Words’. Several combinations of these genetic based feature sets outperformed ‘Bag-of-Words’ at a threshold such that 95% of the curated gene set obtained from the original manual curation of all articles were extracted from the articles classified by machine learning as ‘considered’. The performance was superior in terms of the reduction of required manual curation and two measures of the harmonic mean of precision and recall. The reduction in workload ranged from 0.814 to 0.846 for the dbPTB and 0.301 to 0.371 for the dbPEC. Additionally, a database of metadata and annotations is generated which allows for rapid query of individual features. Our results demonstrate that machine learning algorithms can identify articles with relevant data for databases of genes associated with complex diseases.
format Online
Article
Text
id pubmed-6877776
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-68777762019-11-29 Machine learning approach to literature mining for the genetics of complex diseases Schuster, Jessica Superdock, Michael Agudelo, Anthony Stey, Paul Padbury, James Sarkar, Indra Neil Uzun, Alper Database (Oxford) Original Article To generate a parsimonious gene set for understanding the mechanisms underlying complex diseases, we reasoned it was necessary to combine the curation of public literature, review of experimental databases and interpolation of pathway-associated genes. Using this strategy, we previously built the following two databases for reproductive disorders: The Database for Preterm Birth (dbPTB) and The Database for Preeclampsia (dbPEC). The completeness and accuracy of these databases is essential for supporting our understanding of these complex conditions. Given the exponential increase in biomedical literature, it is becoming increasingly difficult to manually maintain these databases. Using our curated databases as reference data sets, we implemented a machine learning-based approach to optimize article selection for manual curation. We used logistic regression, random forests and neural networks as our machine learning algorithms to classify articles. We examined features derived from abstract text, annotations and metadata that we hypothesized would best classify articles with genetically relevant content associated to the disorder of interest. Combinations of these features were used build the classifiers and the performance of these feature sets were compared to a standard ‘Bag-of-Words’. Several combinations of these genetic based feature sets outperformed ‘Bag-of-Words’ at a threshold such that 95% of the curated gene set obtained from the original manual curation of all articles were extracted from the articles classified by machine learning as ‘considered’. The performance was superior in terms of the reduction of required manual curation and two measures of the harmonic mean of precision and recall. The reduction in workload ranged from 0.814 to 0.846 for the dbPTB and 0.301 to 0.371 for the dbPEC. Additionally, a database of metadata and annotations is generated which allows for rapid query of individual features. Our results demonstrate that machine learning algorithms can identify articles with relevant data for databases of genes associated with complex diseases. Oxford University Press 2019-11-26 /pmc/articles/PMC6877776/ /pubmed/31768545 http://dx.doi.org/10.1093/database/baz124 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Schuster, Jessica
Superdock, Michael
Agudelo, Anthony
Stey, Paul
Padbury, James
Sarkar, Indra Neil
Uzun, Alper
Machine learning approach to literature mining for the genetics of complex diseases
title Machine learning approach to literature mining for the genetics of complex diseases
title_full Machine learning approach to literature mining for the genetics of complex diseases
title_fullStr Machine learning approach to literature mining for the genetics of complex diseases
title_full_unstemmed Machine learning approach to literature mining for the genetics of complex diseases
title_short Machine learning approach to literature mining for the genetics of complex diseases
title_sort machine learning approach to literature mining for the genetics of complex diseases
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6877776/
https://www.ncbi.nlm.nih.gov/pubmed/31768545
http://dx.doi.org/10.1093/database/baz124
work_keys_str_mv AT schusterjessica machinelearningapproachtoliteratureminingforthegeneticsofcomplexdiseases
AT superdockmichael machinelearningapproachtoliteratureminingforthegeneticsofcomplexdiseases
AT agudeloanthony machinelearningapproachtoliteratureminingforthegeneticsofcomplexdiseases
AT steypaul machinelearningapproachtoliteratureminingforthegeneticsofcomplexdiseases
AT padburyjames machinelearningapproachtoliteratureminingforthegeneticsofcomplexdiseases
AT sarkarindraneil machinelearningapproachtoliteratureminingforthegeneticsofcomplexdiseases
AT uzunalper machinelearningapproachtoliteratureminingforthegeneticsofcomplexdiseases