Cargando…

Gene identification in novel eukaryotic genomes by self-training algorithm

Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting...

Descripción completa

Detalles Bibliográficos
Autores principales: Lomsadze, Alexandre, Ter-Hovhannisyan, Vardges, Chernoff, Yury O., Borodovsky, Mark
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2005
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1298918/
https://www.ncbi.nlm.nih.gov/pubmed/16314312
http://dx.doi.org/10.1093/nar/gki937
_version_ 1782126259837337600
author Lomsadze, Alexandre
Ter-Hovhannisyan, Vardges
Chernoff, Yury O.
Borodovsky, Mark
author_facet Lomsadze, Alexandre
Ter-Hovhannisyan, Vardges
Chernoff, Yury O.
Borodovsky, Mark
author_sort Lomsadze, Alexandre
collection PubMed
description Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.
format Text
id pubmed-1298918
institution National Center for Biotechnology Information
language English
publishDate 2005
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-12989182005-12-02 Gene identification in novel eukaryotic genomes by self-training algorithm Lomsadze, Alexandre Ter-Hovhannisyan, Vardges Chernoff, Yury O. Borodovsky, Mark Nucleic Acids Res Article Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification. Oxford University Press 2005 2005-11-28 /pmc/articles/PMC1298918/ /pubmed/16314312 http://dx.doi.org/10.1093/nar/gki937 Text en © The Author 2005. Published by Oxford University Press. All rights reserved
spellingShingle Article
Lomsadze, Alexandre
Ter-Hovhannisyan, Vardges
Chernoff, Yury O.
Borodovsky, Mark
Gene identification in novel eukaryotic genomes by self-training algorithm
title Gene identification in novel eukaryotic genomes by self-training algorithm
title_full Gene identification in novel eukaryotic genomes by self-training algorithm
title_fullStr Gene identification in novel eukaryotic genomes by self-training algorithm
title_full_unstemmed Gene identification in novel eukaryotic genomes by self-training algorithm
title_short Gene identification in novel eukaryotic genomes by self-training algorithm
title_sort gene identification in novel eukaryotic genomes by self-training algorithm
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1298918/
https://www.ncbi.nlm.nih.gov/pubmed/16314312
http://dx.doi.org/10.1093/nar/gki937
work_keys_str_mv AT lomsadzealexandre geneidentificationinnoveleukaryoticgenomesbyselftrainingalgorithm
AT terhovhannisyanvardges geneidentificationinnoveleukaryoticgenomesbyselftrainingalgorithm
AT chernoffyuryo geneidentificationinnoveleukaryoticgenomesbyselftrainingalgorithm
AT borodovskymark geneidentificationinnoveleukaryoticgenomesbyselftrainingalgorithm