Cargando…

Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes

BACKGROUND: Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gen...

Descripción completa

Detalles Bibliográficos
Autores principales: Bocs, Stéphanie, Danchin, Antoine, Médigue, Claudine
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2002
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC77393/
https://www.ncbi.nlm.nih.gov/pubmed/11879526
http://dx.doi.org/10.1186/1471-2105-3-5
_version_ 1782120177791401984
author Bocs, Stéphanie
Danchin, Antoine
Médigue, Claudine
author_facet Bocs, Stéphanie
Danchin, Antoine
Médigue, Claudine
author_sort Bocs, Stéphanie
collection PubMed
description BACKGROUND: Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach. RESULTS: We have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank. CONCLUSIONS: The results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries).
format Text
id pubmed-77393
institution National Center for Biotechnology Information
language English
publishDate 2002
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-773932002-03-07 Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes Bocs, Stéphanie Danchin, Antoine Médigue, Claudine BMC Bioinformatics Research article BACKGROUND: Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach. RESULTS: We have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank. CONCLUSIONS: The results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries). BioMed Central 2002-02-05 /pmc/articles/PMC77393/ /pubmed/11879526 http://dx.doi.org/10.1186/1471-2105-3-5 Text en Copyright ©2002 Bocs et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.
spellingShingle Research article
Bocs, Stéphanie
Danchin, Antoine
Médigue, Claudine
Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes
title Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes
title_full Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes
title_fullStr Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes
title_full_unstemmed Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes
title_short Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes
title_sort re-annotation of genome microbial coding-sequences: finding new genes and inaccurately annotated genes
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC77393/
https://www.ncbi.nlm.nih.gov/pubmed/11879526
http://dx.doi.org/10.1186/1471-2105-3-5
work_keys_str_mv AT bocsstephanie reannotationofgenomemicrobialcodingsequencesfindingnewgenesandinaccuratelyannotatedgenes
AT danchinantoine reannotationofgenomemicrobialcodingsequencesfindingnewgenesandinaccuratelyannotatedgenes
AT medigueclaudine reannotationofgenomemicrobialcodingsequencesfindingnewgenesandinaccuratelyannotatedgenes