Cargando…
String kernels for protein sequence comparisons: improved fold recognition
BACKGROUND: The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5331664/ https://www.ncbi.nlm.nih.gov/pubmed/28245816 http://dx.doi.org/10.1186/s12859-017-1560-9 |
_version_ | 1782511423626149888 |
---|---|
author | Nojoomi, Saghi Koehl, Patrice |
author_facet | Nojoomi, Saghi Koehl, Patrice |
author_sort | Nojoomi, Saghi |
collection | PubMed |
description | BACKGROUND: The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences begin with strings of letters (amino acids) that represent the sequences, before generating textual alignments between these strings and providing scores for each alignment. When the similitude between the two protein sequences to be compared is low however, the quality of the corresponding sequence alignment is usually poor, leading to poor performance for the recognition of similarity. RESULTS: In this study, we develop an alignment free alternative to these methods that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951-984), we introduce our own version, SeqKernel. Its implementation depends on two parameters, a coefficient that tunes the substitution matrix and the maximum length of k-mers that it includes. We provide an exhaustive analysis of the impacts of these two parameters on the performance of SeqKernel for fold recognition. We show that with the right choice of parameters, use of the SeqKernel similarity measure improves fold recognition compared to the use of traditional alignment-based methods. We illustrate the application of SeqKernel to inferring phylogeny on RNA polymerases and show that it performs as well as methods based on multiple sequence alignments. CONCLUSION: We have presented and characterized a new alignment free method based on a mathematical kernel for scoring the similarity of protein sequences. We discuss possible improvements of this method, as well as an extension of its applications to other modeling methods that rely on sequence comparison. |
format | Online Article Text |
id | pubmed-5331664 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-53316642017-03-03 String kernels for protein sequence comparisons: improved fold recognition Nojoomi, Saghi Koehl, Patrice BMC Bioinformatics Methodology Article BACKGROUND: The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences begin with strings of letters (amino acids) that represent the sequences, before generating textual alignments between these strings and providing scores for each alignment. When the similitude between the two protein sequences to be compared is low however, the quality of the corresponding sequence alignment is usually poor, leading to poor performance for the recognition of similarity. RESULTS: In this study, we develop an alignment free alternative to these methods that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951-984), we introduce our own version, SeqKernel. Its implementation depends on two parameters, a coefficient that tunes the substitution matrix and the maximum length of k-mers that it includes. We provide an exhaustive analysis of the impacts of these two parameters on the performance of SeqKernel for fold recognition. We show that with the right choice of parameters, use of the SeqKernel similarity measure improves fold recognition compared to the use of traditional alignment-based methods. We illustrate the application of SeqKernel to inferring phylogeny on RNA polymerases and show that it performs as well as methods based on multiple sequence alignments. CONCLUSION: We have presented and characterized a new alignment free method based on a mathematical kernel for scoring the similarity of protein sequences. We discuss possible improvements of this method, as well as an extension of its applications to other modeling methods that rely on sequence comparison. BioMed Central 2017-02-28 /pmc/articles/PMC5331664/ /pubmed/28245816 http://dx.doi.org/10.1186/s12859-017-1560-9 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Article Nojoomi, Saghi Koehl, Patrice String kernels for protein sequence comparisons: improved fold recognition |
title | String kernels for protein sequence comparisons: improved fold recognition |
title_full | String kernels for protein sequence comparisons: improved fold recognition |
title_fullStr | String kernels for protein sequence comparisons: improved fold recognition |
title_full_unstemmed | String kernels for protein sequence comparisons: improved fold recognition |
title_short | String kernels for protein sequence comparisons: improved fold recognition |
title_sort | string kernels for protein sequence comparisons: improved fold recognition |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5331664/ https://www.ncbi.nlm.nih.gov/pubmed/28245816 http://dx.doi.org/10.1186/s12859-017-1560-9 |
work_keys_str_mv | AT nojoomisaghi stringkernelsforproteinsequencecomparisonsimprovedfoldrecognition AT koehlpatrice stringkernelsforproteinsequencecomparisonsimprovedfoldrecognition |