Cargando…

String kernels for protein sequence comparisons: improved fold recognition

BACKGROUND: The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences...

Descripción completa

Detalles Bibliográficos
Autores principales: Nojoomi, Saghi, Koehl, Patrice
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5331664/
https://www.ncbi.nlm.nih.gov/pubmed/28245816
http://dx.doi.org/10.1186/s12859-017-1560-9
_version_ 1782511423626149888
author Nojoomi, Saghi
Koehl, Patrice
author_facet Nojoomi, Saghi
Koehl, Patrice
author_sort Nojoomi, Saghi
collection PubMed
description BACKGROUND: The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences begin with strings of letters (amino acids) that represent the sequences, before generating textual alignments between these strings and providing scores for each alignment. When the similitude between the two protein sequences to be compared is low however, the quality of the corresponding sequence alignment is usually poor, leading to poor performance for the recognition of similarity. RESULTS: In this study, we develop an alignment free alternative to these methods that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951-984), we introduce our own version, SeqKernel. Its implementation depends on two parameters, a coefficient that tunes the substitution matrix and the maximum length of k-mers that it includes. We provide an exhaustive analysis of the impacts of these two parameters on the performance of SeqKernel for fold recognition. We show that with the right choice of parameters, use of the SeqKernel similarity measure improves fold recognition compared to the use of traditional alignment-based methods. We illustrate the application of SeqKernel to inferring phylogeny on RNA polymerases and show that it performs as well as methods based on multiple sequence alignments. CONCLUSION: We have presented and characterized a new alignment free method based on a mathematical kernel for scoring the similarity of protein sequences. We discuss possible improvements of this method, as well as an extension of its applications to other modeling methods that rely on sequence comparison.
format Online
Article
Text
id pubmed-5331664
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-53316642017-03-03 String kernels for protein sequence comparisons: improved fold recognition Nojoomi, Saghi Koehl, Patrice BMC Bioinformatics Methodology Article BACKGROUND: The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences begin with strings of letters (amino acids) that represent the sequences, before generating textual alignments between these strings and providing scores for each alignment. When the similitude between the two protein sequences to be compared is low however, the quality of the corresponding sequence alignment is usually poor, leading to poor performance for the recognition of similarity. RESULTS: In this study, we develop an alignment free alternative to these methods that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951-984), we introduce our own version, SeqKernel. Its implementation depends on two parameters, a coefficient that tunes the substitution matrix and the maximum length of k-mers that it includes. We provide an exhaustive analysis of the impacts of these two parameters on the performance of SeqKernel for fold recognition. We show that with the right choice of parameters, use of the SeqKernel similarity measure improves fold recognition compared to the use of traditional alignment-based methods. We illustrate the application of SeqKernel to inferring phylogeny on RNA polymerases and show that it performs as well as methods based on multiple sequence alignments. CONCLUSION: We have presented and characterized a new alignment free method based on a mathematical kernel for scoring the similarity of protein sequences. We discuss possible improvements of this method, as well as an extension of its applications to other modeling methods that rely on sequence comparison. BioMed Central 2017-02-28 /pmc/articles/PMC5331664/ /pubmed/28245816 http://dx.doi.org/10.1186/s12859-017-1560-9 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Nojoomi, Saghi
Koehl, Patrice
String kernels for protein sequence comparisons: improved fold recognition
title String kernels for protein sequence comparisons: improved fold recognition
title_full String kernels for protein sequence comparisons: improved fold recognition
title_fullStr String kernels for protein sequence comparisons: improved fold recognition
title_full_unstemmed String kernels for protein sequence comparisons: improved fold recognition
title_short String kernels for protein sequence comparisons: improved fold recognition
title_sort string kernels for protein sequence comparisons: improved fold recognition
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5331664/
https://www.ncbi.nlm.nih.gov/pubmed/28245816
http://dx.doi.org/10.1186/s12859-017-1560-9
work_keys_str_mv AT nojoomisaghi stringkernelsforproteinsequencecomparisonsimprovedfoldrecognition
AT koehlpatrice stringkernelsforproteinsequencecomparisonsimprovedfoldrecognition