Cargando…

MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences

BACKGROUND: While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suf...

Descripción completa

Detalles Bibliográficos
Autores principales:	Corel, Eduardo, Pitschi, Florian, Laprevotte, Ivan, Grasseau, Gilles, Didier, Gilles, Devauchelle, Claudine
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2923138/ https://www.ncbi.nlm.nih.gov/pubmed/20673356 http://dx.doi.org/10.1186/1471-2105-11-406

_version_	1782185482790109184
author	Corel, Eduardo Pitschi, Florian Laprevotte, Ivan Grasseau, Gilles Didier, Gilles Devauchelle, Claudine
author_facet	Corel, Eduardo Pitschi, Florian Laprevotte, Ivan Grasseau, Gilles Didier, Gilles Devauchelle, Claudine
author_sort	Corel, Eduardo
collection	PubMed
description	BACKGROUND: While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool. RESULTS: Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity κ of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR). CONCLUSIONS: The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter κ of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available.
format	Text
id	pubmed-2923138
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-29231382010-08-18 MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences Corel, Eduardo Pitschi, Florian Laprevotte, Ivan Grasseau, Gilles Didier, Gilles Devauchelle, Claudine BMC Bioinformatics Research Article BACKGROUND: While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool. RESULTS: Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity κ of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR). CONCLUSIONS: The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter κ of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available. BioMed Central 2010-07-30 /pmc/articles/PMC2923138/ /pubmed/20673356 http://dx.doi.org/10.1186/1471-2105-11-406 Text en Copyright ©2010 Corel et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Corel, Eduardo Pitschi, Florian Laprevotte, Ivan Grasseau, Gilles Didier, Gilles Devauchelle, Claudine MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
title	MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
title_full	MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
title_fullStr	MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
title_full_unstemmed	MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
title_short	MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
title_sort	ms4 - multi-scale selector of sequence signatures: an alignment-free method for classification of biological sequences
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2923138/ https://www.ncbi.nlm.nih.gov/pubmed/20673356 http://dx.doi.org/10.1186/1471-2105-11-406
work_keys_str_mv	AT coreleduardo ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences AT pitschiflorian ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences AT laprevotteivan ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences AT grasseaugilles ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences AT didiergilles ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences AT devauchelleclaudine ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences

MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences

Ejemplares similares