Cargando…

Supervised DNA Barcodes species classification: analysis, comparisons and results

BACKGROUND: Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on...

Descripción completa

Detalles Bibliográficos
Autores principales: Weitschek, Emanuel, Fiscon, Giulia, Felici, Giovanni
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4022351/
https://www.ncbi.nlm.nih.gov/pubmed/24721333
http://dx.doi.org/10.1186/1756-0381-7-4
_version_ 1782316387277996032
author Weitschek, Emanuel
Fiscon, Giulia
Felici, Giovanni
author_facet Weitschek, Emanuel
Fiscon, Giulia
Felici, Giovanni
author_sort Weitschek, Emanuel
collection PubMed
description BACKGROUND: Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms. METHODS: In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods. RESULTS: A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods. CONCLUSIONS: The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community.
format Online
Article
Text
id pubmed-4022351
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40223512014-05-16 Supervised DNA Barcodes species classification: analysis, comparisons and results Weitschek, Emanuel Fiscon, Giulia Felici, Giovanni BioData Min Research BACKGROUND: Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms. METHODS: In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods. RESULTS: A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods. CONCLUSIONS: The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community. BioMed Central 2014-04-11 /pmc/articles/PMC4022351/ /pubmed/24721333 http://dx.doi.org/10.1186/1756-0381-7-4 Text en Copyright © 2014 Weitschek et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Weitschek, Emanuel
Fiscon, Giulia
Felici, Giovanni
Supervised DNA Barcodes species classification: analysis, comparisons and results
title Supervised DNA Barcodes species classification: analysis, comparisons and results
title_full Supervised DNA Barcodes species classification: analysis, comparisons and results
title_fullStr Supervised DNA Barcodes species classification: analysis, comparisons and results
title_full_unstemmed Supervised DNA Barcodes species classification: analysis, comparisons and results
title_short Supervised DNA Barcodes species classification: analysis, comparisons and results
title_sort supervised dna barcodes species classification: analysis, comparisons and results
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4022351/
https://www.ncbi.nlm.nih.gov/pubmed/24721333
http://dx.doi.org/10.1186/1756-0381-7-4
work_keys_str_mv AT weitschekemanuel superviseddnabarcodesspeciesclassificationanalysiscomparisonsandresults
AT fiscongiulia superviseddnabarcodesspeciesclassificationanalysiscomparisonsandresults
AT felicigiovanni superviseddnabarcodesspeciesclassificationanalysiscomparisonsandresults