Cargando…
A benchmark study of sequence alignment methods for protein clustering
BACKGROUND: Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311937/ https://www.ncbi.nlm.nih.gov/pubmed/30598070 http://dx.doi.org/10.1186/s12859-018-2524-4 |
_version_ | 1783383705607208960 |
---|---|
author | Wang, Yingying Wu, Hongyan Cai, Yunpeng |
author_facet | Wang, Yingying Wu, Hongyan Cai, Yunpeng |
author_sort | Wang, Yingying |
collection | PubMed |
description | BACKGROUND: Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable. RESULTS: Results showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results. CONCLUSIONS: These results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2524-4) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-6311937 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-63119372019-01-07 A benchmark study of sequence alignment methods for protein clustering Wang, Yingying Wu, Hongyan Cai, Yunpeng BMC Bioinformatics Research BACKGROUND: Protein sequence alignment analyses have become a crucial step for many bioinformatics studies during the past decades. Multiple sequence alignment (MSA) and pair-wise sequence alignment (PSA) are two major approaches in sequence alignment. Former benchmark studies revealed drawbacks of MSA methods on nucleotide sequence alignments. To test whether similar drawbacks also influence protein sequence alignment analyses, we propose a new benchmark framework for protein clustering based on cluster validity. This new framework directly reflects the biological ground truth of the application scenarios that adopt sequence alignments, and evaluates the alignment quality according to the achievement of the biological goal, rather than the comparison on sequence level only, which averts the biases introduced by alignment scores or manual alignment templates. Compared with former studies, we calculate the cluster validity score based on sequence distances instead of clustering results. This strategy could avoid the influence brought by different clustering methods thus make results more dependable. RESULTS: Results showed that PSA methods performed better than MSA methods on most of the BAliBASE benchmark datasets. Analyses on the 80 re-sampled benchmark datasets constructed by randomly choosing 90% of each dataset 10 times showed similar results. CONCLUSIONS: These results validated that the drawbacks of MSA methods revealed in nucleotide level also existed in protein sequence alignment analyses and affect the accuracy of results. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2524-4) contains supplementary material, which is available to authorized users. BioMed Central 2018-12-31 /pmc/articles/PMC6311937/ /pubmed/30598070 http://dx.doi.org/10.1186/s12859-018-2524-4 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Wang, Yingying Wu, Hongyan Cai, Yunpeng A benchmark study of sequence alignment methods for protein clustering |
title | A benchmark study of sequence alignment methods for protein clustering |
title_full | A benchmark study of sequence alignment methods for protein clustering |
title_fullStr | A benchmark study of sequence alignment methods for protein clustering |
title_full_unstemmed | A benchmark study of sequence alignment methods for protein clustering |
title_short | A benchmark study of sequence alignment methods for protein clustering |
title_sort | benchmark study of sequence alignment methods for protein clustering |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6311937/ https://www.ncbi.nlm.nih.gov/pubmed/30598070 http://dx.doi.org/10.1186/s12859-018-2524-4 |
work_keys_str_mv | AT wangyingying abenchmarkstudyofsequencealignmentmethodsforproteinclustering AT wuhongyan abenchmarkstudyofsequencealignmentmethodsforproteinclustering AT caiyunpeng abenchmarkstudyofsequencealignmentmethodsforproteinclustering AT wangyingying benchmarkstudyofsequencealignmentmethodsforproteinclustering AT wuhongyan benchmarkstudyofsequencealignmentmethodsforproteinclustering AT caiyunpeng benchmarkstudyofsequencealignmentmethodsforproteinclustering |