Cargando…

Protein multiple sequence alignment benchmarking through secondary structure prediction

MOTIVATION: Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold sta...

Descripción completa

Detalles Bibliográficos
Autores principales:	Le, Quan, Sievers, Fabian, Higgins, Desmond G
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2017
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408826/ https://www.ncbi.nlm.nih.gov/pubmed/28093407 http://dx.doi.org/10.1093/bioinformatics/btw840

_version_	1783232373161197568
author	Le, Quan Sievers, Fabian Higgins, Desmond G
author_facet	Le, Quan Sievers, Fabian Higgins, Desmond G
author_sort	Le, Quan
collection	PubMed
description	MOTIVATION: Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of ‘true’ alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few sequences or require manual alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen sequences. PREFAB and HomFam both rely on using a small subset of sequences of known structure and do not fairly test the quality of a full MSA. RESULTS: In this paper we describe QuanTest, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure alignment quality. This is based on the assumption that better MSAs will give more accurate secondary structure predictions when we include sequences of known structure. SSPA measures the quality of an entire alignment however, not just the accuracy on a handful of selected sequences. It can be scaled to alignments of any size but here we demonstrate its use on alignments of either 200 or 1000 sequences. This allows the testing of slow accurate programs as well as faster, less accurate ones. We show that the scores from QuanTest are highly correlated with existing benchmark scores. We also validate the method by comparing a wide range of MSA alignment options and by including different levels of mis-alignment into MSA, and examining the effects on the scores. AVAILABILITY AND IMPLEMENTATION: QuanTest is available from http://www.bioinf.ucd.ie/download/QuanTest.tgz SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-5408826
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-54088262017-05-03 Protein multiple sequence alignment benchmarking through secondary structure prediction Le, Quan Sievers, Fabian Higgins, Desmond G Bioinformatics Original Papers MOTIVATION: Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of ‘true’ alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few sequences or require manual alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen sequences. PREFAB and HomFam both rely on using a small subset of sequences of known structure and do not fairly test the quality of a full MSA. RESULTS: In this paper we describe QuanTest, a fully automated and highly scalable test system for protein MSAs which is based on using secondary structure prediction accuracy (SSPA) to measure alignment quality. This is based on the assumption that better MSAs will give more accurate secondary structure predictions when we include sequences of known structure. SSPA measures the quality of an entire alignment however, not just the accuracy on a handful of selected sequences. It can be scaled to alignments of any size but here we demonstrate its use on alignments of either 200 or 1000 sequences. This allows the testing of slow accurate programs as well as faster, less accurate ones. We show that the scores from QuanTest are highly correlated with existing benchmark scores. We also validate the method by comparing a wide range of MSA alignment options and by including different levels of mis-alignment into MSA, and examining the effects on the scores. AVAILABILITY AND IMPLEMENTATION: QuanTest is available from http://www.bioinf.ucd.ie/download/QuanTest.tgz SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2017-05-01 2017-01-16 /pmc/articles/PMC5408826/ /pubmed/28093407 http://dx.doi.org/10.1093/bioinformatics/btw840 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Papers Le, Quan Sievers, Fabian Higgins, Desmond G Protein multiple sequence alignment benchmarking through secondary structure prediction
title	Protein multiple sequence alignment benchmarking through secondary structure prediction
title_full	Protein multiple sequence alignment benchmarking through secondary structure prediction
title_fullStr	Protein multiple sequence alignment benchmarking through secondary structure prediction
title_full_unstemmed	Protein multiple sequence alignment benchmarking through secondary structure prediction
title_short	Protein multiple sequence alignment benchmarking through secondary structure prediction
title_sort	protein multiple sequence alignment benchmarking through secondary structure prediction
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408826/ https://www.ncbi.nlm.nih.gov/pubmed/28093407 http://dx.doi.org/10.1093/bioinformatics/btw840
work_keys_str_mv	AT lequan proteinmultiplesequencealignmentbenchmarkingthroughsecondarystructureprediction AT sieversfabian proteinmultiplesequencealignmentbenchmarkingthroughsecondarystructureprediction AT higginsdesmondg proteinmultiplesequencealignmentbenchmarkingthroughsecondarystructureprediction

Protein multiple sequence alignment benchmarking through secondary structure prediction

Ejemplares similares