Cargando…

Scaling up data curation using deep learning: An application to literature triage in genomic variation resources

Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by qu...

Descripción completa

Detalles Bibliográficos
Autores principales: Lee, Kyubum, Famiglietti, Maria Livia, McMahon, Aoife, Wei, Chih-Hsuan, MacArthur, Jacqueline Ann Langdon, Poux, Sylvain, Breuza, Lionel, Bridge, Alan, Cunningham, Fiona, Xenarios, Ioannis, Lu, Zhiyong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6107285/
https://www.ncbi.nlm.nih.gov/pubmed/30102703
http://dx.doi.org/10.1371/journal.pcbi.1006390
_version_ 1783349952219447296
author Lee, Kyubum
Famiglietti, Maria Livia
McMahon, Aoife
Wei, Chih-Hsuan
MacArthur, Jacqueline Ann Langdon
Poux, Sylvain
Breuza, Lionel
Bridge, Alan
Cunningham, Fiona
Xenarios, Ioannis
Lu, Zhiyong
author_facet Lee, Kyubum
Famiglietti, Maria Livia
McMahon, Aoife
Wei, Chih-Hsuan
MacArthur, Jacqueline Ann Langdon
Poux, Sylvain
Breuza, Lionel
Bridge, Alan
Cunningham, Fiona
Xenarios, Ioannis
Lu, Zhiyong
author_sort Lee, Kyubum
collection PubMed
description Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases.
format Online
Article
Text
id pubmed-6107285
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-61072852018-08-30 Scaling up data curation using deep learning: An application to literature triage in genomic variation resources Lee, Kyubum Famiglietti, Maria Livia McMahon, Aoife Wei, Chih-Hsuan MacArthur, Jacqueline Ann Langdon Poux, Sylvain Breuza, Lionel Bridge, Alan Cunningham, Fiona Xenarios, Ioannis Lu, Zhiyong PLoS Comput Biol Research Article Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases. Public Library of Science 2018-08-13 /pmc/articles/PMC6107285/ /pubmed/30102703 http://dx.doi.org/10.1371/journal.pcbi.1006390 Text en https://creativecommons.org/publicdomain/zero/1.0/ This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 (https://creativecommons.org/publicdomain/zero/1.0/) public domain dedication.
spellingShingle Research Article
Lee, Kyubum
Famiglietti, Maria Livia
McMahon, Aoife
Wei, Chih-Hsuan
MacArthur, Jacqueline Ann Langdon
Poux, Sylvain
Breuza, Lionel
Bridge, Alan
Cunningham, Fiona
Xenarios, Ioannis
Lu, Zhiyong
Scaling up data curation using deep learning: An application to literature triage in genomic variation resources
title Scaling up data curation using deep learning: An application to literature triage in genomic variation resources
title_full Scaling up data curation using deep learning: An application to literature triage in genomic variation resources
title_fullStr Scaling up data curation using deep learning: An application to literature triage in genomic variation resources
title_full_unstemmed Scaling up data curation using deep learning: An application to literature triage in genomic variation resources
title_short Scaling up data curation using deep learning: An application to literature triage in genomic variation resources
title_sort scaling up data curation using deep learning: an application to literature triage in genomic variation resources
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6107285/
https://www.ncbi.nlm.nih.gov/pubmed/30102703
http://dx.doi.org/10.1371/journal.pcbi.1006390
work_keys_str_mv AT leekyubum scalingupdatacurationusingdeeplearninganapplicationtoliteraturetriageingenomicvariationresources
AT famigliettimarialivia scalingupdatacurationusingdeeplearninganapplicationtoliteraturetriageingenomicvariationresources
AT mcmahonaoife scalingupdatacurationusingdeeplearninganapplicationtoliteraturetriageingenomicvariationresources
AT weichihhsuan scalingupdatacurationusingdeeplearninganapplicationtoliteraturetriageingenomicvariationresources
AT macarthurjacquelineannlangdon scalingupdatacurationusingdeeplearninganapplicationtoliteraturetriageingenomicvariationresources
AT pouxsylvain scalingupdatacurationusingdeeplearninganapplicationtoliteraturetriageingenomicvariationresources
AT breuzalionel scalingupdatacurationusingdeeplearninganapplicationtoliteraturetriageingenomicvariationresources
AT bridgealan scalingupdatacurationusingdeeplearninganapplicationtoliteraturetriageingenomicvariationresources
AT cunninghamfiona scalingupdatacurationusingdeeplearninganapplicationtoliteraturetriageingenomicvariationresources
AT xenariosioannis scalingupdatacurationusingdeeplearninganapplicationtoliteraturetriageingenomicvariationresources
AT luzhiyong scalingupdatacurationusingdeeplearninganapplicationtoliteraturetriageingenomicvariationresources