Cargando…

PSI-BLAST pseudocounts and the minimum description length principle

Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino a...

Descripción completa

Detalles Bibliográficos
Autores principales: Altschul, Stephen F., Gertz, E. Michael, Agarwala, Richa, Schäffer, Alejandro A., Yu, Yi-Kuo
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2647318/
https://www.ncbi.nlm.nih.gov/pubmed/19088134
http://dx.doi.org/10.1093/nar/gkn981
_version_ 1782164920953995264
author Altschul, Stephen F.
Gertz, E. Michael
Agarwala, Richa
Schäffer, Alejandro A.
Yu, Yi-Kuo
author_facet Altschul, Stephen F.
Gertz, E. Michael
Agarwala, Richa
Schäffer, Alejandro A.
Yu, Yi-Kuo
author_sort Altschul, Stephen F.
collection PubMed
description Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default.
format Text
id pubmed-2647318
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-26473182009-03-04 PSI-BLAST pseudocounts and the minimum description length principle Altschul, Stephen F. Gertz, E. Michael Agarwala, Richa Schäffer, Alejandro A. Yu, Yi-Kuo Nucleic Acids Res Computational Biology Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default. Oxford University Press 2009-02 2008-12-16 /pmc/articles/PMC2647318/ /pubmed/19088134 http://dx.doi.org/10.1093/nar/gkn981 Text en © 2008 The Author(s) http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Computational Biology
Altschul, Stephen F.
Gertz, E. Michael
Agarwala, Richa
Schäffer, Alejandro A.
Yu, Yi-Kuo
PSI-BLAST pseudocounts and the minimum description length principle
title PSI-BLAST pseudocounts and the minimum description length principle
title_full PSI-BLAST pseudocounts and the minimum description length principle
title_fullStr PSI-BLAST pseudocounts and the minimum description length principle
title_full_unstemmed PSI-BLAST pseudocounts and the minimum description length principle
title_short PSI-BLAST pseudocounts and the minimum description length principle
title_sort psi-blast pseudocounts and the minimum description length principle
topic Computational Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2647318/
https://www.ncbi.nlm.nih.gov/pubmed/19088134
http://dx.doi.org/10.1093/nar/gkn981
work_keys_str_mv AT altschulstephenf psiblastpseudocountsandtheminimumdescriptionlengthprinciple
AT gertzemichael psiblastpseudocountsandtheminimumdescriptionlengthprinciple
AT agarwalaricha psiblastpseudocountsandtheminimumdescriptionlengthprinciple
AT schafferalejandroa psiblastpseudocountsandtheminimumdescriptionlengthprinciple
AT yuyikuo psiblastpseudocountsandtheminimumdescriptionlengthprinciple