Cargando…

GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography

SUMMARY: We present GeoBoost2, a natural language-processing pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like National Center of Biotechnology Information’s GenBank for downstream analysis including phylogeography and genomic epi...

Descripción completa

Detalles Bibliográficos
Autores principales: Magge, Arjun, Weissenbacher, Davy, O’Connor, Karen, Tahsin, Tasnia, Gonzalez-Hernandez, Graciela, Scotch, Matthew
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7755405/
https://www.ncbi.nlm.nih.gov/pubmed/32683454
http://dx.doi.org/10.1093/bioinformatics/btaa647
_version_ 1783626348766429184
author Magge, Arjun
Weissenbacher, Davy
O’Connor, Karen
Tahsin, Tasnia
Gonzalez-Hernandez, Graciela
Scotch, Matthew
author_facet Magge, Arjun
Weissenbacher, Davy
O’Connor, Karen
Tahsin, Tasnia
Gonzalez-Hernandez, Graciela
Scotch, Matthew
author_sort Magge, Arjun
collection PubMed
description SUMMARY: We present GeoBoost2, a natural language-processing pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like National Center of Biotechnology Information’s GenBank for downstream analysis including phylogeography and genomic epidemiology. The increasing number of pathogen sequences requires complementary information extraction methods for focused research, including surveillance within countries and between borders. In this article, we describe the enhancements from our earlier release including improvement in end-to-end extraction performance and speed, availability of a fully functional web-interface and state-of-the-art methods for location extraction using deep learning. AVAILABILITY AND IMPLEMENTATION: Application is freely available on the web at https://zodo.asu.edu/geoboost2. Source code, usage examples and annotated data for GeoBoost2 is freely available at https://github.com/ZooPhy/geoboost2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-7755405
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-77554052020-12-29 GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography Magge, Arjun Weissenbacher, Davy O’Connor, Karen Tahsin, Tasnia Gonzalez-Hernandez, Graciela Scotch, Matthew Bioinformatics Applications Notes SUMMARY: We present GeoBoost2, a natural language-processing pipeline for extracting the location of infected hosts for enriching metadata in nucleotide sequences repositories like National Center of Biotechnology Information’s GenBank for downstream analysis including phylogeography and genomic epidemiology. The increasing number of pathogen sequences requires complementary information extraction methods for focused research, including surveillance within countries and between borders. In this article, we describe the enhancements from our earlier release including improvement in end-to-end extraction performance and speed, availability of a fully functional web-interface and state-of-the-art methods for location extraction using deep learning. AVAILABILITY AND IMPLEMENTATION: Application is freely available on the web at https://zodo.asu.edu/geoboost2. Source code, usage examples and annotated data for GeoBoost2 is freely available at https://github.com/ZooPhy/geoboost2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2020-07-19 /pmc/articles/PMC7755405/ /pubmed/32683454 http://dx.doi.org/10.1093/bioinformatics/btaa647 Text en © The Author(s) 2020. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) ), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Applications Notes
Magge, Arjun
Weissenbacher, Davy
O’Connor, Karen
Tahsin, Tasnia
Gonzalez-Hernandez, Graciela
Scotch, Matthew
GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography
title GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography
title_full GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography
title_fullStr GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography
title_full_unstemmed GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography
title_short GeoBoost2: a natural languageprocessing pipeline for GenBank metadata enrichment for virus phylogeography
title_sort geoboost2: a natural languageprocessing pipeline for genbank metadata enrichment for virus phylogeography
topic Applications Notes
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7755405/
https://www.ncbi.nlm.nih.gov/pubmed/32683454
http://dx.doi.org/10.1093/bioinformatics/btaa647
work_keys_str_mv AT maggearjun geoboost2anaturallanguageprocessingpipelineforgenbankmetadataenrichmentforvirusphylogeography
AT weissenbacherdavy geoboost2anaturallanguageprocessingpipelineforgenbankmetadataenrichmentforvirusphylogeography
AT oconnorkaren geoboost2anaturallanguageprocessingpipelineforgenbankmetadataenrichmentforvirusphylogeography
AT tahsintasnia geoboost2anaturallanguageprocessingpipelineforgenbankmetadataenrichmentforvirusphylogeography
AT gonzalezhernandezgraciela geoboost2anaturallanguageprocessingpipelineforgenbankmetadataenrichmentforvirusphylogeography
AT scotchmatthew geoboost2anaturallanguageprocessingpipelineforgenbankmetadataenrichmentforvirusphylogeography