Cargando…

A benchmark study of sequence alignment methods for protein clustering

BACKGROUND: Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Yingying, Wu, Hongyan, Cai, Yunpeng
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311937/
https://www.ncbi.nlm.nih.gov/pubmed/30598070
http://dx.doi.org/10.1186/s12859-018-2524-4
_version_ 1783383705607208960
author Wang, Yingying
Wu, Hongyan
Cai, Yunpeng
author_facet Wang, Yingying
Wu, Hongyan
Cai, Yunpeng
author_sort Wang, Yingying
collection PubMed
description BACKGROUND: Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable. RESULTS: Results showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results. CONCLUSIONS: These results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2524-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6311937
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-63119372019-01-07 A benchmark study of sequence alignment methods for protein clustering Wang, Yingying Wu, Hongyan Cai, Yunpeng BMC Bioinformatics Research BACKGROUND: Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable. RESULTS: Results showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results. CONCLUSIONS: These results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2524-4) contains supplementary material, which is available to authorized users. BioMed Central 2018-12-31 /pmc/articles/PMC6311937/ /pubmed/30598070 http://dx.doi.org/10.1186/s12859-018-2524-4 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Wang, Yingying
Wu, Hongyan
Cai, Yunpeng
A benchmark study of sequence alignment methods for protein clustering
title A benchmark study of sequence alignment methods for protein clustering
title_full A benchmark study of sequence alignment methods for protein clustering
title_fullStr A benchmark study of sequence alignment methods for protein clustering
title_full_unstemmed A benchmark study of sequence alignment methods for protein clustering
title_short A benchmark study of sequence alignment methods for protein clustering
title_sort benchmark study of sequence alignment methods for protein clustering
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311937/
https://www.ncbi.nlm.nih.gov/pubmed/30598070
http://dx.doi.org/10.1186/s12859-018-2524-4
work_keys_str_mv AT wangyingying abenchmarkstudyofsequencealignmentmethodsforproteinclustering
AT wuhongyan abenchmarkstudyofsequencealignmentmethodsforproteinclustering
AT caiyunpeng abenchmarkstudyofsequencealignmentmethodsforproteinclustering
AT wangyingying benchmarkstudyofsequencealignmentmethodsforproteinclustering
AT wuhongyan benchmarkstudyofsequencealignmentmethodsforproteinclustering
AT caiyunpeng benchmarkstudyofsequencealignmentmethodsforproteinclustering