Cargando…

Adaptive GDDA-BLAST: Fast and Efficient Algorithm for Protein Sequence Embedding

A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hong, Yoojin, Kang, Jaewoo, Lee, Dongwon, van Rossum, Damian B.
Formato:	Texto
Lenguaje:	English
Publicado:	Public Library of Science 2010
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2962639/ https://www.ncbi.nlm.nih.gov/pubmed/21042584 http://dx.doi.org/10.1371/journal.pone.0013596

_version_	1782189198343667712
author	Hong, Yoojin Kang, Jaewoo Lee, Dongwon van Rossum, Damian B.
author_facet	Hong, Yoojin Kang, Jaewoo Lee, Dongwon van Rossum, Damian B.
author_sort	Hong, Yoojin
collection	PubMed
description	A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously theorized that embedded-alignment profiles (simply “alignment profiles” hereafter) provide a quantitative method that is capable of relating the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of alignment profiles lies in the interoperability of data format (e.g., alignment information, physio-chemical information, genomic information, etc.). Indeed, we have demonstrated that the Position Specific Scoring Matrices (PSSMs) are an informative M-dimension that is scored by quantitatively measuring the embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, and remains so even in the “twilight zone” of sequence similarity (<25% identity) [1]–[5]. Although our previous embedding strategy was powerful, it suffered from contaminating alignments (embedded AND unmodified) and high computational costs. Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named “Adaptive GDDA-BLAST.” Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method. Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains. Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences. We show that sequence embedding could serve as one of the vehicles for measurement of low-identity alignments and for incorporation thereof into high-performance PSSM-based alignment profiles.
format	Text
id	pubmed-2962639
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-29626392010-11-01 Adaptive GDDA-BLAST: Fast and Efficient Algorithm for Protein Sequence Embedding Hong, Yoojin Kang, Jaewoo Lee, Dongwon van Rossum, Damian B. PLoS One Research Article A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously theorized that embedded-alignment profiles (simply “alignment profiles” hereafter) provide a quantitative method that is capable of relating the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of alignment profiles lies in the interoperability of data format (e.g., alignment information, physio-chemical information, genomic information, etc.). Indeed, we have demonstrated that the Position Specific Scoring Matrices (PSSMs) are an informative M-dimension that is scored by quantitatively measuring the embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, and remains so even in the “twilight zone” of sequence similarity (<25% identity) [1]–[5]. Although our previous embedding strategy was powerful, it suffered from contaminating alignments (embedded AND unmodified) and high computational costs. Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named “Adaptive GDDA-BLAST.” Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method. Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains. Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences. We show that sequence embedding could serve as one of the vehicles for measurement of low-identity alignments and for incorporation thereof into high-performance PSSM-based alignment profiles. Public Library of Science 2010-10-22 /pmc/articles/PMC2962639/ /pubmed/21042584 http://dx.doi.org/10.1371/journal.pone.0013596 Text en Hong et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Hong, Yoojin Kang, Jaewoo Lee, Dongwon van Rossum, Damian B. Adaptive GDDA-BLAST: Fast and Efficient Algorithm for Protein Sequence Embedding
title	Adaptive GDDA-BLAST: Fast and Efficient Algorithm for Protein Sequence Embedding
title_full	Adaptive GDDA-BLAST: Fast and Efficient Algorithm for Protein Sequence Embedding
title_fullStr	Adaptive GDDA-BLAST: Fast and Efficient Algorithm for Protein Sequence Embedding
title_full_unstemmed	Adaptive GDDA-BLAST: Fast and Efficient Algorithm for Protein Sequence Embedding
title_short	Adaptive GDDA-BLAST: Fast and Efficient Algorithm for Protein Sequence Embedding
title_sort	adaptive gdda-blast: fast and efficient algorithm for protein sequence embedding
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2962639/ https://www.ncbi.nlm.nih.gov/pubmed/21042584 http://dx.doi.org/10.1371/journal.pone.0013596
work_keys_str_mv	AT hongyoojin adaptivegddablastfastandefficientalgorithmforproteinsequenceembedding AT kangjaewoo adaptivegddablastfastandefficientalgorithmforproteinsequenceembedding AT leedongwon adaptivegddablastfastandefficientalgorithmforproteinsequenceembedding AT vanrossumdamianb adaptivegddablastfastandefficientalgorithmforproteinsequenceembedding

Adaptive GDDA-BLAST: Fast and Efficient Algorithm for Protein Sequence Embedding

Ejemplares similares