Cargando…

EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance

BACKGROUND: Contrary to other areas of sequence analysis, a measure of statistical significance of a putative gene has not been devised to help in discriminating real genes from the masses of random Open Reading Frames (ORFs) in prokaryotic genomes. Therefore, many genomes have too many short ORFs a...

Descripción completa

Detalles Bibliográficos
Autores principales: Larsen, Thomas Schou, Krogh, Anders
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2003
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC521197/
https://www.ncbi.nlm.nih.gov/pubmed/12783628
http://dx.doi.org/10.1186/1471-2105-4-21
_version_ 1782121831536263168
author Larsen, Thomas Schou
Krogh, Anders
author_facet Larsen, Thomas Schou
Krogh, Anders
author_sort Larsen, Thomas Schou
collection PubMed
description BACKGROUND: Contrary to other areas of sequence analysis, a measure of statistical significance of a putative gene has not been devised to help in discriminating real genes from the masses of random Open Reading Frames (ORFs) in prokaryotic genomes. Therefore, many genomes have too many short ORFs annotated as genes. RESULTS: In this paper, we present a new automated gene-finding method, EasyGene, which estimates the statistical significance of a predicted gene. The gene finder is based on a hidden Markov model (HMM) that is automatically estimated for a new genome. Using extensions of similarities in Swiss-Prot, a high quality training set of genes is automatically extracted from the genome and used to estimate the HMM. Putative genes are then scored with the HMM, and based on score and length of an ORF, the statistical significance is calculated. The measure of statistical significance for an ORF is the expected number of ORFs in one megabase of random sequence at the same significance level or better, where the random sequence has the same statistics as the genome in the sense of a third order Markov chain. CONCLUSIONS: The result is a flexible gene finder whose overall performance matches or exceeds other methods. The entire pipeline of computer processing from the raw input of a genome or set of contigs to a list of putative genes with significance is automated, making it easy to apply EasyGene to newly sequenced organisms. EasyGene with pre-trained models can be accessed at .
format Text
id pubmed-521197
institution National Center for Biotechnology Information
language English
publishDate 2003
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-5211972004-10-04 EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance Larsen, Thomas Schou Krogh, Anders BMC Bioinformatics Research Article BACKGROUND: Contrary to other areas of sequence analysis, a measure of statistical significance of a putative gene has not been devised to help in discriminating real genes from the masses of random Open Reading Frames (ORFs) in prokaryotic genomes. Therefore, many genomes have too many short ORFs annotated as genes. RESULTS: In this paper, we present a new automated gene-finding method, EasyGene, which estimates the statistical significance of a predicted gene. The gene finder is based on a hidden Markov model (HMM) that is automatically estimated for a new genome. Using extensions of similarities in Swiss-Prot, a high quality training set of genes is automatically extracted from the genome and used to estimate the HMM. Putative genes are then scored with the HMM, and based on score and length of an ORF, the statistical significance is calculated. The measure of statistical significance for an ORF is the expected number of ORFs in one megabase of random sequence at the same significance level or better, where the random sequence has the same statistics as the genome in the sense of a third order Markov chain. CONCLUSIONS: The result is a flexible gene finder whose overall performance matches or exceeds other methods. The entire pipeline of computer processing from the raw input of a genome or set of contigs to a list of putative genes with significance is automated, making it easy to apply EasyGene to newly sequenced organisms. EasyGene with pre-trained models can be accessed at . BioMed Central 2003-06-03 /pmc/articles/PMC521197/ /pubmed/12783628 http://dx.doi.org/10.1186/1471-2105-4-21 Text en Copyright © 2003 Larsen and Krogh; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.
spellingShingle Research Article
Larsen, Thomas Schou
Krogh, Anders
EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance
title EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance
title_full EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance
title_fullStr EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance
title_full_unstemmed EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance
title_short EasyGene – a prokaryotic gene finder that ranks ORFs by statistical significance
title_sort easygene – a prokaryotic gene finder that ranks orfs by statistical significance
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC521197/
https://www.ncbi.nlm.nih.gov/pubmed/12783628
http://dx.doi.org/10.1186/1471-2105-4-21
work_keys_str_mv AT larsenthomasschou easygeneaprokaryoticgenefinderthatranksorfsbystatisticalsignificance
AT kroghanders easygeneaprokaryoticgenefinderthatranksorfsbystatisticalsignificance