Cargando…
Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods
The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Medical Informatics Association
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543364/ https://www.ncbi.nlm.nih.gov/pubmed/28815119 |
_version_ | 1783255136294928384 |
---|---|
author | Weissenbacher, Davy Sarker, Abeed Tahsin, Tasnia Scotch, Matthew Gonzalez, Graciela |
author_facet | Weissenbacher, Davy Sarker, Abeed Tahsin, Tasnia Scotch, Matthew Gonzalez, Graciela |
author_sort | Weissenbacher, Davy |
collection | PubMed |
description | The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of the infected host of the virus, which is often present in public databases such as GenBank. However, the geographic metadata in most GenBank records is not precise enough for many phylogeographic studies; therefore, researchers often need to search the articles linked to the records for more information, which can be a tedious process. Here, we describe two approaches for automatically detecting geographic location mentions in articles pertaining to virus-related GenBank records: a supervised sequence labeling approach with innovative features and a distant-supervision approach with novel noise- reduction methods. Evaluated on a manually annotated gold standard, our supervised sequence labeling and distant supervision approaches attained F-scores of 0.81 and 0.66, respectively. |
format | Online Article Text |
id | pubmed-5543364 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | American Medical Informatics Association |
record_format | MEDLINE/PubMed |
spelling | pubmed-55433642017-08-16 Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods Weissenbacher, Davy Sarker, Abeed Tahsin, Tasnia Scotch, Matthew Gonzalez, Graciela AMIA Jt Summits Transl Sci Proc Articles The field of phylogeography allows researchers to model the spread and evolution of viral genetic sequences. Phylogeography plays a major role in infectious disease surveillance, viral epidemiology and vaccine design. When conducting viral phylogeographic studies, researchers require the location of the infected host of the virus, which is often present in public databases such as GenBank. However, the geographic metadata in most GenBank records is not precise enough for many phylogeographic studies; therefore, researchers often need to search the articles linked to the records for more information, which can be a tedious process. Here, we describe two approaches for automatically detecting geographic location mentions in articles pertaining to virus-related GenBank records: a supervised sequence labeling approach with innovative features and a distant-supervision approach with novel noise- reduction methods. Evaluated on a manually annotated gold standard, our supervised sequence labeling and distant supervision approaches attained F-scores of 0.81 and 0.66, respectively. American Medical Informatics Association 2017-07-26 /pmc/articles/PMC5543364/ /pubmed/28815119 Text en ©2017 AMIA - All rights reserved. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose |
spellingShingle | Articles Weissenbacher, Davy Sarker, Abeed Tahsin, Tasnia Scotch, Matthew Gonzalez, Graciela Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods |
title | Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods |
title_full | Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods |
title_fullStr | Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods |
title_full_unstemmed | Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods |
title_short | Extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods |
title_sort | extracting geographic locations from the literature for virus phylogeography using supervised and distant supervision methods |
topic | Articles |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543364/ https://www.ncbi.nlm.nih.gov/pubmed/28815119 |
work_keys_str_mv | AT weissenbacherdavy extractinggeographiclocationsfromtheliteratureforvirusphylogeographyusingsupervisedanddistantsupervisionmethods AT sarkerabeed extractinggeographiclocationsfromtheliteratureforvirusphylogeographyusingsupervisedanddistantsupervisionmethods AT tahsintasnia extractinggeographiclocationsfromtheliteratureforvirusphylogeographyusingsupervisedanddistantsupervisionmethods AT scotchmatthew extractinggeographiclocationsfromtheliteratureforvirusphylogeographyusingsupervisedanddistantsupervisionmethods AT gonzalezgraciela extractinggeographiclocationsfromtheliteratureforvirusphylogeographyusingsupervisedanddistantsupervisionmethods |