Cargando…

Alignment-free analysis of barcode sequences by means of compression-based methods

BACKGROUND: The key idea of DNA barcode initiative is to identify, for each group of species belonging to different kingdoms of life, a short DNA sequence that can act as a true taxon barcode. DNA barcode represents a valuable type of information that can be integrated with ecological, genetic, and...

Descripción completa

Detalles Bibliográficos
Autores principales: La Rosa, Massimo, Fiannaca, Antonino, Rizzo, Riccardo, Urso, Alfonso
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3633054/
https://www.ncbi.nlm.nih.gov/pubmed/23815444
http://dx.doi.org/10.1186/1471-2105-14-S7-S4
_version_ 1782266935146184704
author La Rosa, Massimo
Fiannaca, Antonino
Rizzo, Riccardo
Urso, Alfonso
author_facet La Rosa, Massimo
Fiannaca, Antonino
Rizzo, Riccardo
Urso, Alfonso
author_sort La Rosa, Massimo
collection PubMed
description BACKGROUND: The key idea of DNA barcode initiative is to identify, for each group of species belonging to different kingdoms of life, a short DNA sequence that can act as a true taxon barcode. DNA barcode represents a valuable type of information that can be integrated with ecological, genetic, and morphological data in order to obtain a more consistent taxonomy. Recent studies have shown that, for the animal kingdom, the mitochondrial gene cytochrome c oxidase I (COI), about 650 bp long, can be used as a barcode sequence for identification and taxonomic purposes of animals. In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. Our purpose is to justify the employ of USM also for the analysis of short DNA barcode sequences, showing how USM is able to correctly extract taxonomic information among those kind of sequences. RESULTS: We downloaded from Barcode of Life Data System (BOLD) database 30 datasets of barcode sequences belonging to different animal species. We built phylogenetic trees of every dataset, according to compression-based and classic evolutionary methods, and compared them in terms of topology preservation. In the experimental tests, we obtained scores with a percentage of similarity between evolutionary and compression-based trees between 80% and 100% for the most of datasets (94%). Moreover we carried out experimental tests using simulated barcode datasets composed of 100, 150, 200 and 500 sequences, each simulation replicated 25-fold. In this case, mean similarity scores between evolutionary and compression-based trees span between 83% and 99% for all simulated datasets. CONCLUSIONS: In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. This way we demonstrate the reliability of compression-based methods even for the analysis of short barcode sequences. Compression-based methods, with their strong theoretical assumptions, may then represent a valid alignment-free and parameter-free approach for barcode studies.
format Online
Article
Text
id pubmed-3633054
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-36330542013-04-25 Alignment-free analysis of barcode sequences by means of compression-based methods La Rosa, Massimo Fiannaca, Antonino Rizzo, Riccardo Urso, Alfonso BMC Bioinformatics Research BACKGROUND: The key idea of DNA barcode initiative is to identify, for each group of species belonging to different kingdoms of life, a short DNA sequence that can act as a true taxon barcode. DNA barcode represents a valuable type of information that can be integrated with ecological, genetic, and morphological data in order to obtain a more consistent taxonomy. Recent studies have shown that, for the animal kingdom, the mitochondrial gene cytochrome c oxidase I (COI), about 650 bp long, can be used as a barcode sequence for identification and taxonomic purposes of animals. In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. Our purpose is to justify the employ of USM also for the analysis of short DNA barcode sequences, showing how USM is able to correctly extract taxonomic information among those kind of sequences. RESULTS: We downloaded from Barcode of Life Data System (BOLD) database 30 datasets of barcode sequences belonging to different animal species. We built phylogenetic trees of every dataset, according to compression-based and classic evolutionary methods, and compared them in terms of topology preservation. In the experimental tests, we obtained scores with a percentage of similarity between evolutionary and compression-based trees between 80% and 100% for the most of datasets (94%). Moreover we carried out experimental tests using simulated barcode datasets composed of 100, 150, 200 and 500 sequences, each simulation replicated 25-fold. In this case, mean similarity scores between evolutionary and compression-based trees span between 83% and 99% for all simulated datasets. CONCLUSIONS: In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. This way we demonstrate the reliability of compression-based methods even for the analysis of short barcode sequences. Compression-based methods, with their strong theoretical assumptions, may then represent a valid alignment-free and parameter-free approach for barcode studies. BioMed Central 2013-04-22 /pmc/articles/PMC3633054/ /pubmed/23815444 http://dx.doi.org/10.1186/1471-2105-14-S7-S4 Text en Copyright © 2013 La Rosa et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
La Rosa, Massimo
Fiannaca, Antonino
Rizzo, Riccardo
Urso, Alfonso
Alignment-free analysis of barcode sequences by means of compression-based methods
title Alignment-free analysis of barcode sequences by means of compression-based methods
title_full Alignment-free analysis of barcode sequences by means of compression-based methods
title_fullStr Alignment-free analysis of barcode sequences by means of compression-based methods
title_full_unstemmed Alignment-free analysis of barcode sequences by means of compression-based methods
title_short Alignment-free analysis of barcode sequences by means of compression-based methods
title_sort alignment-free analysis of barcode sequences by means of compression-based methods
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3633054/
https://www.ncbi.nlm.nih.gov/pubmed/23815444
http://dx.doi.org/10.1186/1471-2105-14-S7-S4
work_keys_str_mv AT larosamassimo alignmentfreeanalysisofbarcodesequencesbymeansofcompressionbasedmethods
AT fiannacaantonino alignmentfreeanalysisofbarcodesequencesbymeansofcompressionbasedmethods
AT rizzoriccardo alignmentfreeanalysisofbarcodesequencesbymeansofcompressionbasedmethods
AT ursoalfonso alignmentfreeanalysisofbarcodesequencesbymeansofcompressionbasedmethods