Cargando…
MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences
BACKGROUND: While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suf...
Autores principales: | , , , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2923138/ https://www.ncbi.nlm.nih.gov/pubmed/20673356 http://dx.doi.org/10.1186/1471-2105-11-406 |
_version_ | 1782185482790109184 |
---|---|
author | Corel, Eduardo Pitschi, Florian Laprevotte, Ivan Grasseau, Gilles Didier, Gilles Devauchelle, Claudine |
author_facet | Corel, Eduardo Pitschi, Florian Laprevotte, Ivan Grasseau, Gilles Didier, Gilles Devauchelle, Claudine |
author_sort | Corel, Eduardo |
collection | PubMed |
description | BACKGROUND: While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool. RESULTS: Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity κ of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR). CONCLUSIONS: The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter κ of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available. |
format | Text |
id | pubmed-2923138 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-29231382010-08-18 MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences Corel, Eduardo Pitschi, Florian Laprevotte, Ivan Grasseau, Gilles Didier, Gilles Devauchelle, Claudine BMC Bioinformatics Research Article BACKGROUND: While multiple alignment is the first step of usual classification schemes for biological sequences, alignment-free methods are being increasingly used as alternatives when multiple alignments fail. Subword-based combinatorial methods are popular for their low algorithmic complexity (suffix trees ...) or exhaustivity (motif search), in general with fixed length word and/or number of mismatches. We developed previously a method to detect local similarities (the N-local decoding) based on the occurrences of repeated subwords of fixed length, which does not impose a fixed number of mismatches. The resulting similarities are, for some "good" values of N, sufficiently relevant to form the basis of a reliable alignment-free classification. The aim of this paper is to develop a method that uses the similarities detected by N-local decoding while not imposing a fixed value of N. We present a procedure that selects for every position in the sequences an adaptive value of N, and we implement it as the MS4 classification tool. RESULTS: Among the equivalence classes produced by the N-local decodings for all N, we select a (relatively) small number of "relevant" classes corresponding to variable length subwords that carry enough information to perform the classification. The parameter N, for which correct values are data-dependent and thus hard to guess, is here replaced by the average repetitivity κ of the sequences. We show that our approach yields classifications of several sets of HIV/SIV sequences that agree with the accepted taxonomy, even on usually discarded repetitive regions (like the non-coding part of LTR). CONCLUSIONS: The method MS4 satisfactorily classifies a set of sequences that are notoriously hard to align. This suggests that our approach forms the basis of a reliable alignment-free classification tool. The only parameter κ of MS4 seems to give reasonable results even for its default value, which can be a great advantage for sequence sets for which little information is available. BioMed Central 2010-07-30 /pmc/articles/PMC2923138/ /pubmed/20673356 http://dx.doi.org/10.1186/1471-2105-11-406 Text en Copyright ©2010 Corel et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Corel, Eduardo Pitschi, Florian Laprevotte, Ivan Grasseau, Gilles Didier, Gilles Devauchelle, Claudine MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences |
title | MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences |
title_full | MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences |
title_fullStr | MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences |
title_full_unstemmed | MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences |
title_short | MS4 - Multi-Scale Selector of Sequence Signatures: An alignment-free method for classification of biological sequences |
title_sort | ms4 - multi-scale selector of sequence signatures: an alignment-free method for classification of biological sequences |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2923138/ https://www.ncbi.nlm.nih.gov/pubmed/20673356 http://dx.doi.org/10.1186/1471-2105-11-406 |
work_keys_str_mv | AT coreleduardo ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences AT pitschiflorian ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences AT laprevotteivan ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences AT grasseaugilles ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences AT didiergilles ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences AT devauchelleclaudine ms4multiscaleselectorofsequencesignaturesanalignmentfreemethodforclassificationofbiologicalsequences |