Cargando…

An empirical analysis of training protocols for probabilistic gene finders

BACKGROUND: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been...

Descripción completa

Detalles Bibliográficos
Autores principales:	Majoros, William H, Salzberg, Steven L
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2004
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC544851/ https://www.ncbi.nlm.nih.gov/pubmed/15613242 http://dx.doi.org/10.1186/1471-2105-5-206

_version_	1782122165066268672
author	Majoros, William H Salzberg, Steven L
author_facet	Majoros, William H Salzberg, Steven L
author_sort	Majoros, William H
collection	PubMed
description	BACKGROUND: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters. RESULTS: We decided to investigate the utility of applying a more systematic optimization approach to the tuning of global parameter structure by implementing a global discriminative training procedure for our GHMM-based gene finder. Our results show that significant improvement in prediction accuracy can be achieved by this method. CONCLUSIONS: We conclude that training of GHMM-based gene finders is best performed using some form of discriminative training rather than simple maximum likelihood estimation at the submodel level, and that generalized gradient ascent methods are suitable for this task. We also conclude that partitioning of training data for the twin purposes of maximum likelihood initialization and gradient ascent optimization appears to be unnecessary, but that strict segregation of test data must be enforced during final gene finder evaluation to avoid artificially inflated accuracy measurements.
format	Text
id	pubmed-544851
institution	National Center for Biotechnology Information
language	English
publishDate	2004
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-5448512005-01-21 An empirical analysis of training protocols for probabilistic gene finders Majoros, William H Salzberg, Steven L BMC Bioinformatics Research Article BACKGROUND: Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters. RESULTS: We decided to investigate the utility of applying a more systematic optimization approach to the tuning of global parameter structure by implementing a global discriminative training procedure for our GHMM-based gene finder. Our results show that significant improvement in prediction accuracy can be achieved by this method. CONCLUSIONS: We conclude that training of GHMM-based gene finders is best performed using some form of discriminative training rather than simple maximum likelihood estimation at the submodel level, and that generalized gradient ascent methods are suitable for this task. We also conclude that partitioning of training data for the twin purposes of maximum likelihood initialization and gradient ascent optimization appears to be unnecessary, but that strict segregation of test data must be enforced during final gene finder evaluation to avoid artificially inflated accuracy measurements. BioMed Central 2004-12-21 /pmc/articles/PMC544851/ /pubmed/15613242 http://dx.doi.org/10.1186/1471-2105-5-206 Text en Copyright © 2004 Majoros and Salzberg; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Majoros, William H Salzberg, Steven L An empirical analysis of training protocols for probabilistic gene finders
title	An empirical analysis of training protocols for probabilistic gene finders
title_full	An empirical analysis of training protocols for probabilistic gene finders
title_fullStr	An empirical analysis of training protocols for probabilistic gene finders
title_full_unstemmed	An empirical analysis of training protocols for probabilistic gene finders
title_short	An empirical analysis of training protocols for probabilistic gene finders
title_sort	empirical analysis of training protocols for probabilistic gene finders
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC544851/ https://www.ncbi.nlm.nih.gov/pubmed/15613242 http://dx.doi.org/10.1186/1471-2105-5-206
work_keys_str_mv	AT majoroswilliamh anempiricalanalysisoftrainingprotocolsforprobabilisticgenefinders AT salzbergstevenl anempiricalanalysisoftrainingprotocolsforprobabilisticgenefinders AT majoroswilliamh empiricalanalysisoftrainingprotocolsforprobabilisticgenefinders AT salzbergstevenl empiricalanalysisoftrainingprotocolsforprobabilisticgenefinders

An empirical analysis of training protocols for probabilistic gene finders

Ejemplares similares