Cargando…

HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences

BACKGROUND: Profile Hidden Markov Models (HMM) are statistical representations of protein families derived from patterns of sequence conservation in multiple alignments and have been used in identifying remote homologues with considerable success. These conservation patterns arise from fold specific...

Descripción completa

Detalles Bibliográficos
Autores principales:	Srivastava, Prashant K, Desai, Dhwani K, Nandi, Soumyadeep, Lynn, Andrew M
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2007
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852395/ https://www.ncbi.nlm.nih.gov/pubmed/17389042 http://dx.doi.org/10.1186/1471-2105-8-104

_version_	1782133046687825920
author	Srivastava, Prashant K Desai, Dhwani K Nandi, Soumyadeep Lynn, Andrew M
author_facet	Srivastava, Prashant K Desai, Dhwani K Nandi, Soumyadeep Lynn, Andrew M
author_sort	Srivastava, Prashant K
collection	PubMed
description	BACKGROUND: Profile Hidden Markov Models (HMM) are statistical representations of protein families derived from patterns of sequence conservation in multiple alignments and have been used in identifying remote homologues with considerable success. These conservation patterns arise from fold specific signals, shared across multiple families, and function specific signals unique to the families. The availability of sequences pre-classified according to their function permits the use of negative training sequences to improve the specificity of the HMM, both by optimizing the threshold cutoff and by modifying emission probabilities to minimize the influence of fold-specific signals. A protocol to generate family specific HMMs is described that first constructs a profile HMM from an alignment of the family's sequences and then uses this model to identify sequences belonging to other classes that score above the default threshold (false positives). Ten-fold cross validation is used to optimise the discrimination threshold score for the model. The advent of fast multiple alignment methods enables the use of the profile alignments to align the true and false positive sequences, and the resulting alignments are used to modify the emission probabilities in the original model. RESULTS: The protocol, called HMM-ModE, was validated on a set of sequences belonging to six sub-families of the AGC family of kinases. These sequences have an average sequence similarity of 63% among the group though each sub-group has a different substrate specificity. The optimisation of discrimination threshold, by using negative sequences scored against the model improves specificity in test cases from an average of 21% to 98%. Further discrimination by the HMM after modifying model probabilities using negative training sequences is provided in a few cases, the average specificity rising to 99%. Similar improvements were obtained with a sample of G-Protein coupled receptors sub-classified with respect to their substrate specificity, though the average sequence identity across the sub-families is just 20.6%. The protocol is applied in a high-throughput classification exercise on protein kinases. CONCLUSION: The protocol has the potential to maximise the contributions of discriminating residues to classify proteins based on their molecular function, using pre-classified positive and negative sequence training data. The high specificity of the method, and increasing availability of pre-classified sequence data holds the potential for its application in sequence annotation.
format	Text
id	pubmed-1852395
institution	National Center for Biotechnology Information
language	English
publishDate	2007
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-18523952007-04-18 HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences Srivastava, Prashant K Desai, Dhwani K Nandi, Soumyadeep Lynn, Andrew M BMC Bioinformatics Methodology Article BACKGROUND: Profile Hidden Markov Models (HMM) are statistical representations of protein families derived from patterns of sequence conservation in multiple alignments and have been used in identifying remote homologues with considerable success. These conservation patterns arise from fold specific signals, shared across multiple families, and function specific signals unique to the families. The availability of sequences pre-classified according to their function permits the use of negative training sequences to improve the specificity of the HMM, both by optimizing the threshold cutoff and by modifying emission probabilities to minimize the influence of fold-specific signals. A protocol to generate family specific HMMs is described that first constructs a profile HMM from an alignment of the family's sequences and then uses this model to identify sequences belonging to other classes that score above the default threshold (false positives). Ten-fold cross validation is used to optimise the discrimination threshold score for the model. The advent of fast multiple alignment methods enables the use of the profile alignments to align the true and false positive sequences, and the resulting alignments are used to modify the emission probabilities in the original model. RESULTS: The protocol, called HMM-ModE, was validated on a set of sequences belonging to six sub-families of the AGC family of kinases. These sequences have an average sequence similarity of 63% among the group though each sub-group has a different substrate specificity. The optimisation of discrimination threshold, by using negative sequences scored against the model improves specificity in test cases from an average of 21% to 98%. Further discrimination by the HMM after modifying model probabilities using negative training sequences is provided in a few cases, the average specificity rising to 99%. Similar improvements were obtained with a sample of G-Protein coupled receptors sub-classified with respect to their substrate specificity, though the average sequence identity across the sub-families is just 20.6%. The protocol is applied in a high-throughput classification exercise on protein kinases. CONCLUSION: The protocol has the potential to maximise the contributions of discriminating residues to classify proteins based on their molecular function, using pre-classified positive and negative sequence training data. The high specificity of the method, and increasing availability of pre-classified sequence data holds the potential for its application in sequence annotation. BioMed Central 2007-03-27 /pmc/articles/PMC1852395/ /pubmed/17389042 http://dx.doi.org/10.1186/1471-2105-8-104 Text en Copyright © 2007 Srivastava et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Srivastava, Prashant K Desai, Dhwani K Nandi, Soumyadeep Lynn, Andrew M HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title	HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title_full	HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title_fullStr	HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title_full_unstemmed	HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title_short	HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title_sort	hmm-mode – improved classification using profile hidden markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852395/ https://www.ncbi.nlm.nih.gov/pubmed/17389042 http://dx.doi.org/10.1186/1471-2105-8-104
work_keys_str_mv	AT srivastavaprashantk hmmmodeimprovedclassificationusingprofilehiddenmarkovmodelsbyoptimisingthediscriminationthresholdandmodifyingemissionprobabilitieswithnegativetrainingsequences AT desaidhwanik hmmmodeimprovedclassificationusingprofilehiddenmarkovmodelsbyoptimisingthediscriminationthresholdandmodifyingemissionprobabilitieswithnegativetrainingsequences AT nandisoumyadeep hmmmodeimprovedclassificationusingprofilehiddenmarkovmodelsbyoptimisingthediscriminationthresholdandmodifyingemissionprobabilitieswithnegativetrainingsequences AT lynnandrewm hmmmodeimprovedclassificationusingprofilehiddenmarkovmodelsbyoptimisingthediscriminationthresholdandmodifyingemissionprobabilitieswithnegativetrainingsequences

HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences

Ejemplares similares