Cargando…

Missing genes in the annotation of prokaryotic genomes

BACKGROUND: Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). There...

Descripción completa

Detalles Bibliográficos
Autores principales: Warren, Andrew S, Archuleta, Jeremy, Feng, Wu-chun, Setubal, João Carlos
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098052/
https://www.ncbi.nlm.nih.gov/pubmed/20230630
http://dx.doi.org/10.1186/1471-2105-11-131
_version_ 1782203906242117632
author Warren, Andrew S
Archuleta, Jeremy
Feng, Wu-chun
Setubal, João Carlos
author_facet Warren, Andrew S
Archuleta, Jeremy
Feng, Wu-chun
Setubal, João Carlos
author_sort Warren, Andrew S
collection PubMed
description BACKGROUND: Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes. RESULTS: We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs. CONCLUSIONS: Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.
format Text
id pubmed-3098052
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30980522011-05-20 Missing genes in the annotation of prokaryotic genomes Warren, Andrew S Archuleta, Jeremy Feng, Wu-chun Setubal, João Carlos BMC Bioinformatics Methodology Article BACKGROUND: Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes. RESULTS: We have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs. CONCLUSIONS: Prokaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene. BioMed Central 2010-03-15 /pmc/articles/PMC3098052/ /pubmed/20230630 http://dx.doi.org/10.1186/1471-2105-11-131 Text en Copyright ©2010 Warren et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Warren, Andrew S
Archuleta, Jeremy
Feng, Wu-chun
Setubal, João Carlos
Missing genes in the annotation of prokaryotic genomes
title Missing genes in the annotation of prokaryotic genomes
title_full Missing genes in the annotation of prokaryotic genomes
title_fullStr Missing genes in the annotation of prokaryotic genomes
title_full_unstemmed Missing genes in the annotation of prokaryotic genomes
title_short Missing genes in the annotation of prokaryotic genomes
title_sort missing genes in the annotation of prokaryotic genomes
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098052/
https://www.ncbi.nlm.nih.gov/pubmed/20230630
http://dx.doi.org/10.1186/1471-2105-11-131
work_keys_str_mv AT warrenandrews missinggenesintheannotationofprokaryoticgenomes
AT archuletajeremy missinggenesintheannotationofprokaryoticgenomes
AT fengwuchun missinggenesintheannotationofprokaryoticgenomes
AT setubaljoaocarlos missinggenesintheannotationofprokaryoticgenomes