Cargando…

String kernels for protein sequence comparisons: improved fold recognition

BACKGROUND: The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nojoomi, Saghi, Koehl, Patrice
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5331664/ https://www.ncbi.nlm.nih.gov/pubmed/28245816 http://dx.doi.org/10.1186/s12859-017-1560-9

_version_	1782511423626149888
author	Nojoomi, Saghi Koehl, Patrice
author_facet	Nojoomi, Saghi Koehl, Patrice
author_sort	Nojoomi, Saghi
collection	PubMed
description	BACKGROUND: The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences begin with strings of letters (amino acids) that represent the sequences, before generating textual alignments between these strings and providing scores for each alignment. When the similitude between the two protein sequences to be compared is low however, the quality of the corresponding sequence alignment is usually poor, leading to poor performance for the recognition of similarity. RESULTS: In this study, we develop an alignment free alternative to these methods that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951-984), we introduce our own version, SeqKernel. Its implementation depends on two parameters, a coefficient that tunes the substitution matrix and the maximum length of k-mers that it includes. We provide an exhaustive analysis of the impacts of these two parameters on the performance of SeqKernel for fold recognition. We show that with the right choice of parameters, use of the SeqKernel similarity measure improves fold recognition compared to the use of traditional alignment-based methods. We illustrate the application of SeqKernel to inferring phylogeny on RNA polymerases and show that it performs as well as methods based on multiple sequence alignments. CONCLUSION: We have presented and characterized a new alignment free method based on a mathematical kernel for scoring the similarity of protein sequences. We discuss possible improvements of this method, as well as an extension of its applications to other modeling methods that rely on sequence comparison.
format	Online Article Text
id	pubmed-5331664
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-53316642017-03-03 String kernels for protein sequence comparisons: improved fold recognition Nojoomi, Saghi Koehl, Patrice BMC Bioinformatics Methodology Article BACKGROUND: The amino acid sequence of a protein is the blueprint from which its structure and ultimately function can be derived. Therefore, sequence comparison methods remain essential for the determination of similarity between proteins. Traditional approaches for comparing two protein sequences begin with strings of letters (amino acids) that represent the sequences, before generating textual alignments between these strings and providing scores for each alignment. When the similitude between the two protein sequences to be compared is low however, the quality of the corresponding sequence alignment is usually poor, leading to poor performance for the recognition of similarity. RESULTS: In this study, we develop an alignment free alternative to these methods that is based on the concept of string kernels. Starting from recently proposed kernels on the discrete space of protein sequences (Shen et al, Found. Comput. Math., 2013,14:951-984), we introduce our own version, SeqKernel. Its implementation depends on two parameters, a coefficient that tunes the substitution matrix and the maximum length of k-mers that it includes. We provide an exhaustive analysis of the impacts of these two parameters on the performance of SeqKernel for fold recognition. We show that with the right choice of parameters, use of the SeqKernel similarity measure improves fold recognition compared to the use of traditional alignment-based methods. We illustrate the application of SeqKernel to inferring phylogeny on RNA polymerases and show that it performs as well as methods based on multiple sequence alignments. CONCLUSION: We have presented and characterized a new alignment free method based on a mathematical kernel for scoring the similarity of protein sequences. We discuss possible improvements of this method, as well as an extension of its applications to other modeling methods that rely on sequence comparison. BioMed Central 2017-02-28 /pmc/articles/PMC5331664/ /pubmed/28245816 http://dx.doi.org/10.1186/s12859-017-1560-9 Text en © The Author(s) 2017 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Nojoomi, Saghi Koehl, Patrice String kernels for protein sequence comparisons: improved fold recognition
title	String kernels for protein sequence comparisons: improved fold recognition
title_full	String kernels for protein sequence comparisons: improved fold recognition
title_fullStr	String kernels for protein sequence comparisons: improved fold recognition
title_full_unstemmed	String kernels for protein sequence comparisons: improved fold recognition
title_short	String kernels for protein sequence comparisons: improved fold recognition
title_sort	string kernels for protein sequence comparisons: improved fold recognition
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5331664/ https://www.ncbi.nlm.nih.gov/pubmed/28245816 http://dx.doi.org/10.1186/s12859-017-1560-9
work_keys_str_mv	AT nojoomisaghi stringkernelsforproteinsequencecomparisonsimprovedfoldrecognition AT koehlpatrice stringkernelsforproteinsequencecomparisonsimprovedfoldrecognition

String kernels for protein sequence comparisons: improved fold recognition

Ejemplares similares