Cargando…
DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment
BACKGROUND: Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish...
Autor principal: | |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4595117/ https://www.ncbi.nlm.nih.gov/pubmed/26445311 http://dx.doi.org/10.1186/s12859-015-0749-z |
_version_ | 1782393541146705920 |
---|---|
author | Wright, Erik S. |
author_facet | Wright, Erik S. |
author_sort | Wright, Erik S. |
collection | PubMed |
description | BACKGROUND: Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. RESULTS: Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on large sequence sets. CONCLUSIONS: Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. Moreover, secondary structure predictions increase in accuracy as more sequences are used in the prediction. This enables the scalable generation of large sequence alignments that maintain high accuracy even on diverse sequence sets. The DECIPHER R package and source code are freely available for download at DECIPHER.cee.wisc.edu and from the Bioconductor repository. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0749-z) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4595117 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-45951172015-10-07 DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment Wright, Erik S. BMC Bioinformatics Research Article BACKGROUND: Alignment of large and diverse sequence sets is a common task in biological investigations, yet there remains considerable room for improvement in alignment quality. Multiple sequence alignment programs tend to reach maximal accuracy when aligning only a few sequences, and then diminish steadily as more sequences are added. This drop in accuracy can be partly attributed to a build-up of error and ambiguity as more sequences are aligned. Most high-throughput sequence alignment algorithms do not use contextual information under the assumption that sites are independent. This study examines the extent to which local sequence context can be exploited to improve the quality of large multiple sequence alignments. RESULTS: Two predictors based on local sequence context were assessed: (i) single sequence secondary structure predictions, and (ii) modulation of gap costs according to the surrounding residues. The results indicate that context-based predictors have appreciable information content that can be utilized to create more accurate alignments. Furthermore, local context becomes more informative as the number of sequences increases, enabling more accurate protein alignments of large empirical benchmarks. These discoveries became the basis for DECIPHER, a new context-aware program for sequence alignment, which outperformed other programs on large sequence sets. CONCLUSIONS: Predicting secondary structure based on local sequence context is an efficient means of breaking the independence assumption in alignment. Since secondary structure is more conserved than primary sequence, it can be leveraged to improve the alignment of distantly related proteins. Moreover, secondary structure predictions increase in accuracy as more sequences are used in the prediction. This enables the scalable generation of large sequence alignments that maintain high accuracy even on diverse sequence sets. The DECIPHER R package and source code are freely available for download at DECIPHER.cee.wisc.edu and from the Bioconductor repository. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0749-z) contains supplementary material, which is available to authorized users. BioMed Central 2015-10-06 /pmc/articles/PMC4595117/ /pubmed/26445311 http://dx.doi.org/10.1186/s12859-015-0749-z Text en © Wright. 2015 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Article Wright, Erik S. DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment |
title | DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment |
title_full | DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment |
title_fullStr | DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment |
title_full_unstemmed | DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment |
title_short | DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment |
title_sort | decipher: harnessing local sequence context to improve protein multiple sequence alignment |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4595117/ https://www.ncbi.nlm.nih.gov/pubmed/26445311 http://dx.doi.org/10.1186/s12859-015-0749-z |
work_keys_str_mv | AT wrighteriks decipherharnessinglocalsequencecontexttoimproveproteinmultiplesequencealignment |