Cargando…
Mining locus tags in PubMed Central to improve microbial gene annotation
BACKGROUND: The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases. We propose to utilize the Open Access (OA) subset of PubMed Central (PMC) as a gene annotation database...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3937057/ https://www.ncbi.nlm.nih.gov/pubmed/24499370 http://dx.doi.org/10.1186/1471-2105-15-43 |
_version_ | 1782305420440764416 |
---|---|
author | Stubben, Chris J Challacombe, Jean F |
author_facet | Stubben, Chris J Challacombe, Jean F |
author_sort | Stubben, Chris J |
collection | PubMed |
description | BACKGROUND: The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases. We propose to utilize the Open Access (OA) subset of PubMed Central (PMC) as a gene annotation database and have developed an R package called pmcXML to automatically mine and extract locus tags from full text, tables and supplements. RESULTS: We mined locus tags from 1835 OA publications in ten microbial genomes and extracted tags mentioned in 30,891 sentences in main text and 20,489 rows in tables. We identified locus tag pairs marking the start and end of a region such as an operon or genomic island and expanded these ranges to add another 13,043 tags. We also searched for locus tags in supplementary tables and publications outside the OA subset in Burkholderia pseudomallei K96243 for comparison. There were 168 publications containing 48,470 locus tags and 83% of mentions were from supplementary materials and 9% from publications outside the OA subset. CONCLUSIONS: B. pseudomallei locus tags within the full text and tables of OA publications represent only a small fraction of the total mentions in the literature. For microbial genomes with very few functionally characterized proteins, the locus tags mentioned in supplementary tables and within ranges like genomic islands contain the majority of locus tags. Significantly, the functions in the R package provide access to additional resources in the OA subset that are not currently indexed or returned by searching PMC. |
format | Online Article Text |
id | pubmed-3937057 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-39370572014-02-28 Mining locus tags in PubMed Central to improve microbial gene annotation Stubben, Chris J Challacombe, Jean F BMC Bioinformatics Software BACKGROUND: The scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases. We propose to utilize the Open Access (OA) subset of PubMed Central (PMC) as a gene annotation database and have developed an R package called pmcXML to automatically mine and extract locus tags from full text, tables and supplements. RESULTS: We mined locus tags from 1835 OA publications in ten microbial genomes and extracted tags mentioned in 30,891 sentences in main text and 20,489 rows in tables. We identified locus tag pairs marking the start and end of a region such as an operon or genomic island and expanded these ranges to add another 13,043 tags. We also searched for locus tags in supplementary tables and publications outside the OA subset in Burkholderia pseudomallei K96243 for comparison. There were 168 publications containing 48,470 locus tags and 83% of mentions were from supplementary materials and 9% from publications outside the OA subset. CONCLUSIONS: B. pseudomallei locus tags within the full text and tables of OA publications represent only a small fraction of the total mentions in the literature. For microbial genomes with very few functionally characterized proteins, the locus tags mentioned in supplementary tables and within ranges like genomic islands contain the majority of locus tags. Significantly, the functions in the R package provide access to additional resources in the OA subset that are not currently indexed or returned by searching PMC. BioMed Central 2014-02-05 /pmc/articles/PMC3937057/ /pubmed/24499370 http://dx.doi.org/10.1186/1471-2105-15-43 Text en Copyright © 2014 Stubben and Challacombe; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. |
spellingShingle | Software Stubben, Chris J Challacombe, Jean F Mining locus tags in PubMed Central to improve microbial gene annotation |
title | Mining locus tags in PubMed Central to improve microbial gene annotation |
title_full | Mining locus tags in PubMed Central to improve microbial gene annotation |
title_fullStr | Mining locus tags in PubMed Central to improve microbial gene annotation |
title_full_unstemmed | Mining locus tags in PubMed Central to improve microbial gene annotation |
title_short | Mining locus tags in PubMed Central to improve microbial gene annotation |
title_sort | mining locus tags in pubmed central to improve microbial gene annotation |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3937057/ https://www.ncbi.nlm.nih.gov/pubmed/24499370 http://dx.doi.org/10.1186/1471-2105-15-43 |
work_keys_str_mv | AT stubbenchrisj mininglocustagsinpubmedcentraltoimprovemicrobialgeneannotation AT challacombejeanf mininglocustagsinpubmedcentraltoimprovemicrobialgeneannotation |