Cargando…
Supervised DNA Barcodes species classification: analysis, comparisons and results
BACKGROUND: Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4022351/ https://www.ncbi.nlm.nih.gov/pubmed/24721333 http://dx.doi.org/10.1186/1756-0381-7-4 |
_version_ | 1782316387277996032 |
---|---|
author | Weitschek, Emanuel Fiscon, Giulia Felici, Giovanni |
author_facet | Weitschek, Emanuel Fiscon, Giulia Felici, Giovanni |
author_sort | Weitschek, Emanuel |
collection | PubMed |
description | BACKGROUND: Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms. METHODS: In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods. RESULTS: A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods. CONCLUSIONS: The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community. |
format | Online Article Text |
id | pubmed-4022351 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-40223512014-05-16 Supervised DNA Barcodes species classification: analysis, comparisons and results Weitschek, Emanuel Fiscon, Giulia Felici, Giovanni BioData Min Research BACKGROUND: Specific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms. METHODS: In this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods. RESULTS: A software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods. CONCLUSIONS: The classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community. BioMed Central 2014-04-11 /pmc/articles/PMC4022351/ /pubmed/24721333 http://dx.doi.org/10.1186/1756-0381-7-4 Text en Copyright © 2014 Weitschek et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Weitschek, Emanuel Fiscon, Giulia Felici, Giovanni Supervised DNA Barcodes species classification: analysis, comparisons and results |
title | Supervised DNA Barcodes species classification: analysis, comparisons and results |
title_full | Supervised DNA Barcodes species classification: analysis, comparisons and results |
title_fullStr | Supervised DNA Barcodes species classification: analysis, comparisons and results |
title_full_unstemmed | Supervised DNA Barcodes species classification: analysis, comparisons and results |
title_short | Supervised DNA Barcodes species classification: analysis, comparisons and results |
title_sort | supervised dna barcodes species classification: analysis, comparisons and results |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4022351/ https://www.ncbi.nlm.nih.gov/pubmed/24721333 http://dx.doi.org/10.1186/1756-0381-7-4 |
work_keys_str_mv | AT weitschekemanuel superviseddnabarcodesspeciesclassificationanalysiscomparisonsandresults AT fiscongiulia superviseddnabarcodesspeciesclassificationanalysiscomparisonsandresults AT felicigiovanni superviseddnabarcodesspeciesclassificationanalysiscomparisonsandresults |