Cargando…
Alignment-free analysis of barcode sequences by means of compression-based methods
BACKGROUND: The key idea of DNA barcode initiative is to identify, for each group of species belonging to different kingdoms of life, a short DNA sequence that can act as a true taxon barcode. DNA barcode represents a valuable type of information that can be integrated with ecological, genetic, and...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3633054/ https://www.ncbi.nlm.nih.gov/pubmed/23815444 http://dx.doi.org/10.1186/1471-2105-14-S7-S4 |
_version_ | 1782266935146184704 |
---|---|
author | La Rosa, Massimo Fiannaca, Antonino Rizzo, Riccardo Urso, Alfonso |
author_facet | La Rosa, Massimo Fiannaca, Antonino Rizzo, Riccardo Urso, Alfonso |
author_sort | La Rosa, Massimo |
collection | PubMed |
description | BACKGROUND: The key idea of DNA barcode initiative is to identify, for each group of species belonging to different kingdoms of life, a short DNA sequence that can act as a true taxon barcode. DNA barcode represents a valuable type of information that can be integrated with ecological, genetic, and morphological data in order to obtain a more consistent taxonomy. Recent studies have shown that, for the animal kingdom, the mitochondrial gene cytochrome c oxidase I (COI), about 650 bp long, can be used as a barcode sequence for identification and taxonomic purposes of animals. In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. Our purpose is to justify the employ of USM also for the analysis of short DNA barcode sequences, showing how USM is able to correctly extract taxonomic information among those kind of sequences. RESULTS: We downloaded from Barcode of Life Data System (BOLD) database 30 datasets of barcode sequences belonging to different animal species. We built phylogenetic trees of every dataset, according to compression-based and classic evolutionary methods, and compared them in terms of topology preservation. In the experimental tests, we obtained scores with a percentage of similarity between evolutionary and compression-based trees between 80% and 100% for the most of datasets (94%). Moreover we carried out experimental tests using simulated barcode datasets composed of 100, 150, 200 and 500 sequences, each simulation replicated 25-fold. In this case, mean similarity scores between evolutionary and compression-based trees span between 83% and 99% for all simulated datasets. CONCLUSIONS: In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. This way we demonstrate the reliability of compression-based methods even for the analysis of short barcode sequences. Compression-based methods, with their strong theoretical assumptions, may then represent a valid alignment-free and parameter-free approach for barcode studies. |
format | Online Article Text |
id | pubmed-3633054 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-36330542013-04-25 Alignment-free analysis of barcode sequences by means of compression-based methods La Rosa, Massimo Fiannaca, Antonino Rizzo, Riccardo Urso, Alfonso BMC Bioinformatics Research BACKGROUND: The key idea of DNA barcode initiative is to identify, for each group of species belonging to different kingdoms of life, a short DNA sequence that can act as a true taxon barcode. DNA barcode represents a valuable type of information that can be integrated with ecological, genetic, and morphological data in order to obtain a more consistent taxonomy. Recent studies have shown that, for the animal kingdom, the mitochondrial gene cytochrome c oxidase I (COI), about 650 bp long, can be used as a barcode sequence for identification and taxonomic purposes of animals. In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. Our purpose is to justify the employ of USM also for the analysis of short DNA barcode sequences, showing how USM is able to correctly extract taxonomic information among those kind of sequences. RESULTS: We downloaded from Barcode of Life Data System (BOLD) database 30 datasets of barcode sequences belonging to different animal species. We built phylogenetic trees of every dataset, according to compression-based and classic evolutionary methods, and compared them in terms of topology preservation. In the experimental tests, we obtained scores with a percentage of similarity between evolutionary and compression-based trees between 80% and 100% for the most of datasets (94%). Moreover we carried out experimental tests using simulated barcode datasets composed of 100, 150, 200 and 500 sequences, each simulation replicated 25-fold. In this case, mean similarity scores between evolutionary and compression-based trees span between 83% and 99% for all simulated datasets. CONCLUSIONS: In the present work we aims at introducing the use of an alignment-free approach in order to make taxonomic analysis of barcode sequences. Our approach is based on the use of two compression-based versions of non-computable Universal Similarity Metric (USM) class of distances. This way we demonstrate the reliability of compression-based methods even for the analysis of short barcode sequences. Compression-based methods, with their strong theoretical assumptions, may then represent a valid alignment-free and parameter-free approach for barcode studies. BioMed Central 2013-04-22 /pmc/articles/PMC3633054/ /pubmed/23815444 http://dx.doi.org/10.1186/1471-2105-14-S7-S4 Text en Copyright © 2013 La Rosa et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research La Rosa, Massimo Fiannaca, Antonino Rizzo, Riccardo Urso, Alfonso Alignment-free analysis of barcode sequences by means of compression-based methods |
title | Alignment-free analysis of barcode sequences by means of compression-based methods |
title_full | Alignment-free analysis of barcode sequences by means of compression-based methods |
title_fullStr | Alignment-free analysis of barcode sequences by means of compression-based methods |
title_full_unstemmed | Alignment-free analysis of barcode sequences by means of compression-based methods |
title_short | Alignment-free analysis of barcode sequences by means of compression-based methods |
title_sort | alignment-free analysis of barcode sequences by means of compression-based methods |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3633054/ https://www.ncbi.nlm.nih.gov/pubmed/23815444 http://dx.doi.org/10.1186/1471-2105-14-S7-S4 |
work_keys_str_mv | AT larosamassimo alignmentfreeanalysisofbarcodesequencesbymeansofcompressionbasedmethods AT fiannacaantonino alignmentfreeanalysisofbarcodesequencesbymeansofcompressionbasedmethods AT rizzoriccardo alignmentfreeanalysisofbarcodesequencesbymeansofcompressionbasedmethods AT ursoalfonso alignmentfreeanalysisofbarcodesequencesbymeansofcompressionbasedmethods |