Cargando…

Automatic annotation of eukaryotic genes, pseudogenes and promoters

BACKGROUND: The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudo...

Descripción completa

Detalles Bibliográficos
Autores principales: Solovyev, Victor, Kosarev, Peter, Seledsov, Igor, Vorobyev, Denis
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1810547/
https://www.ncbi.nlm.nih.gov/pubmed/16925832
http://dx.doi.org/10.1186/gb-2006-7-s1-s10
_version_ 1782132598619766784
author Solovyev, Victor
Kosarev, Peter
Seledsov, Igor
Vorobyev, Denis
author_facet Solovyev, Victor
Kosarev, Peter
Seledsov, Igor
Vorobyev, Denis
author_sort Solovyev, Victor
collection PubMed
description BACKGROUND: The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudogenes and promoters in 44 selected ENCODE sequences representing approximately 1% (30 Mb) of the human genome. Predictions of gene finding programs were evaluated in terms of their ability to reproduce the ENCODE-HAVANA annotation. RESULTS: The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software. CONCLUSION: We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome.
format Text
id pubmed-1810547
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18105472007-03-07 Automatic annotation of eukaryotic genes, pseudogenes and promoters Solovyev, Victor Kosarev, Peter Seledsov, Igor Vorobyev, Denis Genome Biol Research BACKGROUND: The ENCODE gene prediction workshop (EGASP) has been organized to evaluate how well state-of-the-art automatic gene finding methods are able to reproduce the manual and experimental gene annotation of the human genome. We have used Softberry gene finding software to predict genes, pseudogenes and promoters in 44 selected ENCODE sequences representing approximately 1% (30 Mb) of the human genome. Predictions of gene finding programs were evaluated in terms of their ability to reproduce the ENCODE-HAVANA annotation. RESULTS: The Fgenesh++ gene prediction pipeline can identify 91% of coding nucleotides with a specificity of 90%. Our automatic pseudogene finder (PSF program) found 90% of the manually annotated pseudogenes and some new ones. The Fprom promoter prediction program identifies 80% of TATA promoters sequences with one false positive prediction per 2,000 base-pairs (bp) and 50% of TATA-less promoters with one false positive prediction per 650 bp. It can be used to identify transcription start sites upstream of annotated coding parts of genes found by gene prediction software. CONCLUSION: We review our software and underlying methods for identifying these three important structural and functional genome components and discuss the accuracy of predictions, recent advances and open problems in annotating genomic sequences. We have demonstrated that our methods can be effectively used for initial automatic annotation of the eukaryotic genome. BioMed Central 2006 2006-08-07 /pmc/articles/PMC1810547/ /pubmed/16925832 http://dx.doi.org/10.1186/gb-2006-7-s1-s10 Text en Copyright © 2006 Solovyev et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Solovyev, Victor
Kosarev, Peter
Seledsov, Igor
Vorobyev, Denis
Automatic annotation of eukaryotic genes, pseudogenes and promoters
title Automatic annotation of eukaryotic genes, pseudogenes and promoters
title_full Automatic annotation of eukaryotic genes, pseudogenes and promoters
title_fullStr Automatic annotation of eukaryotic genes, pseudogenes and promoters
title_full_unstemmed Automatic annotation of eukaryotic genes, pseudogenes and promoters
title_short Automatic annotation of eukaryotic genes, pseudogenes and promoters
title_sort automatic annotation of eukaryotic genes, pseudogenes and promoters
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1810547/
https://www.ncbi.nlm.nih.gov/pubmed/16925832
http://dx.doi.org/10.1186/gb-2006-7-s1-s10
work_keys_str_mv AT solovyevvictor automaticannotationofeukaryoticgenespseudogenesandpromoters
AT kosarevpeter automaticannotationofeukaryoticgenespseudogenesandpromoters
AT seledsovigor automaticannotationofeukaryoticgenespseudogenesandpromoters
AT vorobyevdenis automaticannotationofeukaryoticgenespseudogenesandpromoters