Cargando…

Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research

GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such a...

Descripción completa

Detalles Bibliográficos
Autores principales: Tahsin, Tasnia, Weissenbacher, Davy, Jones-Shargani, Demetrius, Magee, Daniel, Vaiente, Matteo, Gonzalez, Graciela, Scotch, Matthew
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6225896/
https://www.ncbi.nlm.nih.gov/pubmed/30412219
http://dx.doi.org/10.1093/database/bax093
_version_ 1783369868611944448
author Tahsin, Tasnia
Weissenbacher, Davy
Jones-Shargani, Demetrius
Magee, Daniel
Vaiente, Matteo
Gonzalez, Graciela
Scotch, Matthew
author_facet Tahsin, Tasnia
Weissenbacher, Davy
Jones-Shargani, Demetrius
Magee, Daniel
Vaiente, Matteo
Gonzalez, Graciela
Scotch, Matthew
author_sort Tahsin, Tasnia
collection PubMed
description GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such as Taxonomy, PubMed and Protein. For example, a GenBank record of an influenza A hemagglutinin gene DNA sequence might have a link to the Taxonomy database for the organism, a link to the related article in PubMed (if published) and a link to the Protein entry for the hemagglutinin protein. Despite its importance in biomedical research such as population genetics, phylogeography and public health surveillance, the host and geospatial metadata of genetic sequences in GenBank are not linked to any database. Therefore, to facilitate biomedical research based on georeferenced DNA sequences and/or DNA sequences with normalized host names, we designed and developed a framework that enriches GenBank entries by linking their host metadata to the NCBI Taxonomy database and their geospatial metadata to a comprehensive knowledge base of geographic locations called GeoNames. Here, we introduce a database created through the application of this framework to virus sequences in GenBank, and evaluate our normalization algorithms on a set of manually annotated records pertaining to viruses. Although currently applied to viruses, our framework can be easily extended to other organisms, and we discuss the potential utilization of our resource for biomedical research. Database URL: https://zodo.asu.edu/zoophydb/
format Online
Article
Text
id pubmed-6225896
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-62258962018-11-14 Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research Tahsin, Tasnia Weissenbacher, Davy Jones-Shargani, Demetrius Magee, Daniel Vaiente, Matteo Gonzalez, Graciela Scotch, Matthew Database (Oxford) Original Article GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such as Taxonomy, PubMed and Protein. For example, a GenBank record of an influenza A hemagglutinin gene DNA sequence might have a link to the Taxonomy database for the organism, a link to the related article in PubMed (if published) and a link to the Protein entry for the hemagglutinin protein. Despite its importance in biomedical research such as population genetics, phylogeography and public health surveillance, the host and geospatial metadata of genetic sequences in GenBank are not linked to any database. Therefore, to facilitate biomedical research based on georeferenced DNA sequences and/or DNA sequences with normalized host names, we designed and developed a framework that enriches GenBank entries by linking their host metadata to the NCBI Taxonomy database and their geospatial metadata to a comprehensive knowledge base of geographic locations called GeoNames. Here, we introduce a database created through the application of this framework to virus sequences in GenBank, and evaluate our normalization algorithms on a set of manually annotated records pertaining to viruses. Although currently applied to viruses, our framework can be easily extended to other organisms, and we discuss the potential utilization of our resource for biomedical research. Database URL: https://zodo.asu.edu/zoophydb/ Oxford University Press 2017-12-28 /pmc/articles/PMC6225896/ /pubmed/30412219 http://dx.doi.org/10.1093/database/bax093 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Tahsin, Tasnia
Weissenbacher, Davy
Jones-Shargani, Demetrius
Magee, Daniel
Vaiente, Matteo
Gonzalez, Graciela
Scotch, Matthew
Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research
title Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research
title_full Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research
title_fullStr Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research
title_full_unstemmed Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research
title_short Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research
title_sort named entity linking of geospatial and host metadata in genbank for advancing biomedical research
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6225896/
https://www.ncbi.nlm.nih.gov/pubmed/30412219
http://dx.doi.org/10.1093/database/bax093
work_keys_str_mv AT tahsintasnia namedentitylinkingofgeospatialandhostmetadataingenbankforadvancingbiomedicalresearch
AT weissenbacherdavy namedentitylinkingofgeospatialandhostmetadataingenbankforadvancingbiomedicalresearch
AT jonessharganidemetrius namedentitylinkingofgeospatialandhostmetadataingenbankforadvancingbiomedicalresearch
AT mageedaniel namedentitylinkingofgeospatialandhostmetadataingenbankforadvancingbiomedicalresearch
AT vaientematteo namedentitylinkingofgeospatialandhostmetadataingenbankforadvancingbiomedicalresearch
AT gonzalezgraciela namedentitylinkingofgeospatialandhostmetadataingenbankforadvancingbiomedicalresearch
AT scotchmatthew namedentitylinkingofgeospatialandhostmetadataingenbankforadvancingbiomedicalresearch