Cargando…

Thousands of missed genes found in bacterial genomes and their analysis with COMBREX

BACKGROUND: The dramatic reduction in the cost of sequencing has allowed many researchers to join in the effort of sequencing and annotating prokaryotic genomes. Annotation methods vary considerably and may fail to identify some genes. Here we draw attention to a large number of likely genes missing...

Descripción completa

Detalles Bibliográficos
Autores principales: Wood, Derrick E, Lin, Henry, Levy-Moonshine, Ami, Swaminathan, Rajiswari, Chang, Yi-Chien, Anton, Brian P, Osmani, Lais, Steffen, Martin, Kasif, Simon, Salzberg, Steven L
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3534567/
https://www.ncbi.nlm.nih.gov/pubmed/23111013
http://dx.doi.org/10.1186/1745-6150-7-37
_version_ 1782475356651913216
author Wood, Derrick E
Lin, Henry
Levy-Moonshine, Ami
Swaminathan, Rajiswari
Chang, Yi-Chien
Anton, Brian P
Osmani, Lais
Steffen, Martin
Kasif, Simon
Salzberg, Steven L
author_facet Wood, Derrick E
Lin, Henry
Levy-Moonshine, Ami
Swaminathan, Rajiswari
Chang, Yi-Chien
Anton, Brian P
Osmani, Lais
Steffen, Martin
Kasif, Simon
Salzberg, Steven L
author_sort Wood, Derrick E
collection PubMed
description BACKGROUND: The dramatic reduction in the cost of sequencing has allowed many researchers to join in the effort of sequencing and annotating prokaryotic genomes. Annotation methods vary considerably and may fail to identify some genes. Here we draw attention to a large number of likely genes missing from annotations using common tools such as Glimmer and BLAST. RESULTS: By analyzing 1,474 prokaryotic genome annotations in GenBank, we identify 13,602 likely missed genes that are homologs to non-hypothetical proteins, and 11,792 likely missed genes that are homologs only to hypothetical proteins, yet have supporting evidence of their protein-coding nature from COMBREX, a newly created gene function database. We also estimate the likelihood that each potential missing gene found is a genuine protein-coding gene using COMBREX. CONCLUSIONS: Our analysis of the causes of missed genes suggests that larger annotation centers tend to produce annotations with fewer missed genes than smaller centers, and many of the missed genes are short genes <300 bp. Over 1,000 of the likely missed genes could be associated with phenotype information available in COMBREX. 359 of these genes, found in pathogenic organisms, may be potential targets for pharmaceutical research. The newly identified genes are available on COMBREX’s website. REVIEWERS: This article was reviewed by Daniel Haft, Arcady Mushegian, and M. Pilar Francino (nominated by David Ardell).
format Online
Article
Text
id pubmed-3534567
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-35345672013-01-03 Thousands of missed genes found in bacterial genomes and their analysis with COMBREX Wood, Derrick E Lin, Henry Levy-Moonshine, Ami Swaminathan, Rajiswari Chang, Yi-Chien Anton, Brian P Osmani, Lais Steffen, Martin Kasif, Simon Salzberg, Steven L Biol Direct Research BACKGROUND: The dramatic reduction in the cost of sequencing has allowed many researchers to join in the effort of sequencing and annotating prokaryotic genomes. Annotation methods vary considerably and may fail to identify some genes. Here we draw attention to a large number of likely genes missing from annotations using common tools such as Glimmer and BLAST. RESULTS: By analyzing 1,474 prokaryotic genome annotations in GenBank, we identify 13,602 likely missed genes that are homologs to non-hypothetical proteins, and 11,792 likely missed genes that are homologs only to hypothetical proteins, yet have supporting evidence of their protein-coding nature from COMBREX, a newly created gene function database. We also estimate the likelihood that each potential missing gene found is a genuine protein-coding gene using COMBREX. CONCLUSIONS: Our analysis of the causes of missed genes suggests that larger annotation centers tend to produce annotations with fewer missed genes than smaller centers, and many of the missed genes are short genes <300 bp. Over 1,000 of the likely missed genes could be associated with phenotype information available in COMBREX. 359 of these genes, found in pathogenic organisms, may be potential targets for pharmaceutical research. The newly identified genes are available on COMBREX’s website. REVIEWERS: This article was reviewed by Daniel Haft, Arcady Mushegian, and M. Pilar Francino (nominated by David Ardell). BioMed Central 2012-10-30 /pmc/articles/PMC3534567/ /pubmed/23111013 http://dx.doi.org/10.1186/1745-6150-7-37 Text en Copyright ©2012 Wood et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Wood, Derrick E
Lin, Henry
Levy-Moonshine, Ami
Swaminathan, Rajiswari
Chang, Yi-Chien
Anton, Brian P
Osmani, Lais
Steffen, Martin
Kasif, Simon
Salzberg, Steven L
Thousands of missed genes found in bacterial genomes and their analysis with COMBREX
title Thousands of missed genes found in bacterial genomes and their analysis with COMBREX
title_full Thousands of missed genes found in bacterial genomes and their analysis with COMBREX
title_fullStr Thousands of missed genes found in bacterial genomes and their analysis with COMBREX
title_full_unstemmed Thousands of missed genes found in bacterial genomes and their analysis with COMBREX
title_short Thousands of missed genes found in bacterial genomes and their analysis with COMBREX
title_sort thousands of missed genes found in bacterial genomes and their analysis with combrex
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3534567/
https://www.ncbi.nlm.nih.gov/pubmed/23111013
http://dx.doi.org/10.1186/1745-6150-7-37
work_keys_str_mv AT woodderricke thousandsofmissedgenesfoundinbacterialgenomesandtheiranalysiswithcombrex
AT linhenry thousandsofmissedgenesfoundinbacterialgenomesandtheiranalysiswithcombrex
AT levymoonshineami thousandsofmissedgenesfoundinbacterialgenomesandtheiranalysiswithcombrex
AT swaminathanrajiswari thousandsofmissedgenesfoundinbacterialgenomesandtheiranalysiswithcombrex
AT changyichien thousandsofmissedgenesfoundinbacterialgenomesandtheiranalysiswithcombrex
AT antonbrianp thousandsofmissedgenesfoundinbacterialgenomesandtheiranalysiswithcombrex
AT osmanilais thousandsofmissedgenesfoundinbacterialgenomesandtheiranalysiswithcombrex
AT steffenmartin thousandsofmissedgenesfoundinbacterialgenomesandtheiranalysiswithcombrex
AT kasifsimon thousandsofmissedgenesfoundinbacterialgenomesandtheiranalysiswithcombrex
AT salzbergstevenl thousandsofmissedgenesfoundinbacterialgenomesandtheiranalysiswithcombrex