Cargando…

HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences

BACKGROUND: Profile Hidden Markov Models (HMM) are statistical representations of protein families derived from patterns of sequence conservation in multiple alignments and have been used in identifying remote homologues with considerable success. These conservation patterns arise from fold specific...

Descripción completa

Detalles Bibliográficos
Autores principales: Srivastava, Prashant K, Desai, Dhwani K, Nandi, Soumyadeep, Lynn, Andrew M
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2007
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852395/
https://www.ncbi.nlm.nih.gov/pubmed/17389042
http://dx.doi.org/10.1186/1471-2105-8-104
_version_ 1782133046687825920
author Srivastava, Prashant K
Desai, Dhwani K
Nandi, Soumyadeep
Lynn, Andrew M
author_facet Srivastava, Prashant K
Desai, Dhwani K
Nandi, Soumyadeep
Lynn, Andrew M
author_sort Srivastava, Prashant K
collection PubMed
description BACKGROUND: Profile Hidden Markov Models (HMM) are statistical representations of protein families derived from patterns of sequence conservation in multiple alignments and have been used in identifying remote homologues with considerable success. These conservation patterns arise from fold specific signals, shared across multiple families, and function specific signals unique to the families. The availability of sequences pre-classified according to their function permits the use of negative training sequences to improve the specificity of the HMM, both by optimizing the threshold cutoff and by modifying emission probabilities to minimize the influence of fold-specific signals. A protocol to generate family specific HMMs is described that first constructs a profile HMM from an alignment of the family's sequences and then uses this model to identify sequences belonging to other classes that score above the default threshold (false positives). Ten-fold cross validation is used to optimise the discrimination threshold score for the model. The advent of fast multiple alignment methods enables the use of the profile alignments to align the true and false positive sequences, and the resulting alignments are used to modify the emission probabilities in the original model. RESULTS: The protocol, called HMM-ModE, was validated on a set of sequences belonging to six sub-families of the AGC family of kinases. These sequences have an average sequence similarity of 63% among the group though each sub-group has a different substrate specificity. The optimisation of discrimination threshold, by using negative sequences scored against the model improves specificity in test cases from an average of 21% to 98%. Further discrimination by the HMM after modifying model probabilities using negative training sequences is provided in a few cases, the average specificity rising to 99%. Similar improvements were obtained with a sample of G-Protein coupled receptors sub-classified with respect to their substrate specificity, though the average sequence identity across the sub-families is just 20.6%. The protocol is applied in a high-throughput classification exercise on protein kinases. CONCLUSION: The protocol has the potential to maximise the contributions of discriminating residues to classify proteins based on their molecular function, using pre-classified positive and negative sequence training data. The high specificity of the method, and increasing availability of pre-classified sequence data holds the potential for its application in sequence annotation.
format Text
id pubmed-1852395
institution National Center for Biotechnology Information
language English
publishDate 2007
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-18523952007-04-18 HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences Srivastava, Prashant K Desai, Dhwani K Nandi, Soumyadeep Lynn, Andrew M BMC Bioinformatics Methodology Article BACKGROUND: Profile Hidden Markov Models (HMM) are statistical representations of protein families derived from patterns of sequence conservation in multiple alignments and have been used in identifying remote homologues with considerable success. These conservation patterns arise from fold specific signals, shared across multiple families, and function specific signals unique to the families. The availability of sequences pre-classified according to their function permits the use of negative training sequences to improve the specificity of the HMM, both by optimizing the threshold cutoff and by modifying emission probabilities to minimize the influence of fold-specific signals. A protocol to generate family specific HMMs is described that first constructs a profile HMM from an alignment of the family's sequences and then uses this model to identify sequences belonging to other classes that score above the default threshold (false positives). Ten-fold cross validation is used to optimise the discrimination threshold score for the model. The advent of fast multiple alignment methods enables the use of the profile alignments to align the true and false positive sequences, and the resulting alignments are used to modify the emission probabilities in the original model. RESULTS: The protocol, called HMM-ModE, was validated on a set of sequences belonging to six sub-families of the AGC family of kinases. These sequences have an average sequence similarity of 63% among the group though each sub-group has a different substrate specificity. The optimisation of discrimination threshold, by using negative sequences scored against the model improves specificity in test cases from an average of 21% to 98%. Further discrimination by the HMM after modifying model probabilities using negative training sequences is provided in a few cases, the average specificity rising to 99%. Similar improvements were obtained with a sample of G-Protein coupled receptors sub-classified with respect to their substrate specificity, though the average sequence identity across the sub-families is just 20.6%. The protocol is applied in a high-throughput classification exercise on protein kinases. CONCLUSION: The protocol has the potential to maximise the contributions of discriminating residues to classify proteins based on their molecular function, using pre-classified positive and negative sequence training data. The high specificity of the method, and increasing availability of pre-classified sequence data holds the potential for its application in sequence annotation. BioMed Central 2007-03-27 /pmc/articles/PMC1852395/ /pubmed/17389042 http://dx.doi.org/10.1186/1471-2105-8-104 Text en Copyright © 2007 Srivastava et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Srivastava, Prashant K
Desai, Dhwani K
Nandi, Soumyadeep
Lynn, Andrew M
HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title_full HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title_fullStr HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title_full_unstemmed HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title_short HMM-ModE – Improved classification using profile hidden Markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
title_sort hmm-mode – improved classification using profile hidden markov models by optimising the discrimination threshold and modifying emission probabilities with negative training sequences
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1852395/
https://www.ncbi.nlm.nih.gov/pubmed/17389042
http://dx.doi.org/10.1186/1471-2105-8-104
work_keys_str_mv AT srivastavaprashantk hmmmodeimprovedclassificationusingprofilehiddenmarkovmodelsbyoptimisingthediscriminationthresholdandmodifyingemissionprobabilitieswithnegativetrainingsequences
AT desaidhwanik hmmmodeimprovedclassificationusingprofilehiddenmarkovmodelsbyoptimisingthediscriminationthresholdandmodifyingemissionprobabilitieswithnegativetrainingsequences
AT nandisoumyadeep hmmmodeimprovedclassificationusingprofilehiddenmarkovmodelsbyoptimisingthediscriminationthresholdandmodifyingemissionprobabilitieswithnegativetrainingsequences
AT lynnandrewm hmmmodeimprovedclassificationusingprofilehiddenmarkovmodelsbyoptimisingthediscriminationthresholdandmodifyingemissionprobabilitieswithnegativetrainingsequences