Cargando…

Subfamily specific conservation profiles for proteins based on n-gram patterns

BACKGROUND: A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of...

Descripción completa

Detalles Bibliográficos
Autores principales: Vries, John K, Liu, Xiong
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2267698/
https://www.ncbi.nlm.nih.gov/pubmed/18234090
http://dx.doi.org/10.1186/1471-2105-9-72
_version_ 1782151644846227456
author Vries, John K
Liu, Xiong
author_facet Vries, John K
Liu, Xiong
author_sort Vries, John K
collection PubMed
description BACKGROUND: A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query. RESULTS: The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach. CONCLUSION: Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach.
format Text
id pubmed-2267698
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-22676982008-03-18 Subfamily specific conservation profiles for proteins based on n-gram patterns Vries, John K Liu, Xiong BMC Bioinformatics Research Article BACKGROUND: A new algorithm has been developed for generating conservation profiles that reflect the evolutionary history of the subfamily associated with a query sequence. It is based on n-gram patterns (NP{n,m}) which are sets of n residues and m wildcards in windows of size n+m. The generation of conservation profiles is treated as a signal-to-noise problem where the signal is the count of n-gram patterns in target sequences that are similar to the query sequence and the noise is the count over all target sequences. The signal is differentiated from the noise by applying singular value decomposition to sets of target sequences rank ordered by similarity with respect to the query. RESULTS: The new algorithm was used to construct 4,248 profiles from 120 randomly selected Pfam-A families. These were compared to profiles generated from multiple alignments using the consensus approach. The two profiles were similar whenever the subfamily associated with the query sequence was well represented in the multiple alignment. It was possible to construct subfamily specific conservation profiles using the new algorithm for subfamilies with as few as five members. The speed of the new algorithm was comparable to the multiple alignment approach. CONCLUSION: Subfamily specific conservation profiles can be generated by the new algorithm without aprioi knowledge of family relationships or domain architecture. This is useful when the subfamily contains multiple domains with different levels of representation in protein databases. It may also be applicable when the subfamily sample size is too small for the multiple alignment approach. BioMed Central 2008-01-30 /pmc/articles/PMC2267698/ /pubmed/18234090 http://dx.doi.org/10.1186/1471-2105-9-72 Text en Copyright © 2008 Vries and Liu; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Vries, John K
Liu, Xiong
Subfamily specific conservation profiles for proteins based on n-gram patterns
title Subfamily specific conservation profiles for proteins based on n-gram patterns
title_full Subfamily specific conservation profiles for proteins based on n-gram patterns
title_fullStr Subfamily specific conservation profiles for proteins based on n-gram patterns
title_full_unstemmed Subfamily specific conservation profiles for proteins based on n-gram patterns
title_short Subfamily specific conservation profiles for proteins based on n-gram patterns
title_sort subfamily specific conservation profiles for proteins based on n-gram patterns
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2267698/
https://www.ncbi.nlm.nih.gov/pubmed/18234090
http://dx.doi.org/10.1186/1471-2105-9-72
work_keys_str_mv AT vriesjohnk subfamilyspecificconservationprofilesforproteinsbasedonngrampatterns
AT liuxiong subfamilyspecificconservationprofilesforproteinsbasedonngrampatterns