Cargando…

Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods

The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of...

Descripción completa

Detalles Bibliográficos
Autores principales: Weissenbacher, Davy, Sarker, Abeed, Tahsin, Tasnia, Scotch, Matthew, Gonzalez, Graciela
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Medical Informatics Association 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543364/
https://www.ncbi.nlm.nih.gov/pubmed/28815119
_version_ 1783255136294928384
author Weissenbacher, Davy
Sarker, Abeed
Tahsin, Tasnia
Scotch, Matthew
Gonzalez, Graciela
author_facet Weissenbacher, Davy
Sarker, Abeed
Tahsin, Tasnia
Scotch, Matthew
Gonzalez, Graciela
author_sort Weissenbacher, Davy
collection PubMed
description The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of the infected host of the virus, which is often present in public databases such as GenBank. However, the geographic metadata in most GenBank records is not precise enough for many phylogeographic studies; therefore, researchers often need to search the articles linked to the records for more information, which can be a tedious process. Here, we describe two approaches for automatically detecting geographic location mentions in articles pertaining to virus-related GenBank records: a supervised sequence labeling approach with innovative features and a distant-supervision approach with novel noise- reduction methods. Evaluated on a manually annotated gold standard, our supervised sequence labeling and distant supervision approaches attained F-scores of 0.81 and 0.66, respectively.
format Online
Article
Text
id pubmed-5543364
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher American Medical Informatics Association
record_format MEDLINE/PubMed
spelling pubmed-55433642017-08-16 Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods Weissenbacher, Davy Sarker, Abeed Tahsin, Tasnia Scotch, Matthew Gonzalez, Graciela AMIA Jt Summits Transl Sci Proc Articles The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of the infected host of the virus, which is often present in public databases such as GenBank. However, the geographic metadata in most GenBank records is not precise enough for many phylogeographic studies; therefore, researchers often need to search the articles linked to the records for more information, which can be a tedious process. Here, we describe two approaches for automatically detecting geographic location mentions in articles pertaining to virus-related GenBank records: a supervised sequence labeling approach with innovative features and a distant-supervision approach with novel noise- reduction methods. Evaluated on a manually annotated gold standard, our supervised sequence labeling and distant supervision approaches attained F-scores of 0.81 and 0.66, respectively. American Medical Informatics Association 2017-07-26 /pmc/articles/PMC5543364/ /pubmed/28815119 Text en ©2017 AMIA - All rights reserved. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose
spellingShingle Articles
Weissenbacher, Davy
Sarker, Abeed
Tahsin, Tasnia
Scotch, Matthew
Gonzalez, Graciela
Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods
title Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods
title_full Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods
title_fullStr Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods
title_full_unstemmed Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods
title_short Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods
title_sort extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods
topic Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543364/
https://www.ncbi.nlm.nih.gov/pubmed/28815119
work_keys_str_mv AT weissenbacherdavy extractinggeographiclocationsfromtheliteratureforvirusphylogeographyusingsupervisedanddistantsupervisionmethods
AT sarkerabeed extractinggeographiclocationsfromtheliteratureforvirusphylogeographyusingsupervisedanddistantsupervisionmethods
AT tahsintasnia extractinggeographiclocationsfromtheliteratureforvirusphylogeographyusingsupervisedanddistantsupervisionmethods
AT scotchmatthew extractinggeographiclocationsfromtheliteratureforvirusphylogeographyusingsupervisedanddistantsupervisionmethods
AT gonzalezgraciela extractinggeographiclocationsfromtheliteratureforvirusphylogeographyusingsupervisedanddistantsupervisionmethods