Cargando…

DNA barcode analysis: a comparison of phylogenetic and statistical classification methods

BACKGROUND: DNA barcoding aims to assign individuals to given species according to their sequence at a small locus, generally part of the CO1 mitochondrial gene. Amongst other issues, this raises the question of how to deal with within-species genetic variability and potential transpecific polymorph...

Descripción completa

Detalles Bibliográficos
Autores principales: Austerlitz, Frederic, David, Olivier, Schaeffer, Brigitte, Bleakley, Kevin, Olteanu, Madalina, Leblois, Raphael, Veuille, Michel, Laredo, Catherine
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2775147/
https://www.ncbi.nlm.nih.gov/pubmed/19900297
http://dx.doi.org/10.1186/1471-2105-10-S14-S10
_version_ 1782173991265370112
author Austerlitz, Frederic
David, Olivier
Schaeffer, Brigitte
Bleakley, Kevin
Olteanu, Madalina
Leblois, Raphael
Veuille, Michel
Laredo, Catherine
author_facet Austerlitz, Frederic
David, Olivier
Schaeffer, Brigitte
Bleakley, Kevin
Olteanu, Madalina
Leblois, Raphael
Veuille, Michel
Laredo, Catherine
author_sort Austerlitz, Frederic
collection PubMed
description BACKGROUND: DNA barcoding aims to assign individuals to given species according to their sequence at a small locus, generally part of the CO1 mitochondrial gene. Amongst other issues, this raises the question of how to deal with within-species genetic variability and potential transpecific polymorphism. In this context, we examine several assignation methods belonging to two main categories: (i) phylogenetic methods (neighbour-joining and PhyML) that attempt to account for the genealogical framework of DNA evolution and (ii) supervised classification methods (k-nearest neighbour, CART, random forest and kernel methods). These methods range from basic to elaborate. We investigated the ability of each method to correctly classify query sequences drawn from samples of related species using both simulated and real data. Simulated data sets were generated using coalescent simulations in which we varied the genealogical history, mutation parameter, sample size and number of species. RESULTS: No method was found to be the best in all cases. The simplest method of all, "one nearest neighbour", was found to be the most reliable with respect to changes in the parameters of the data sets. The parameter most influencing the performance of the various methods was molecular diversity of the data. Addition of genetically independent loci - nuclear genes - improved the predictive performance of most methods. CONCLUSION: The study implies that taxonomists can influence the quality of their analyses either by choosing a method best-adapted to the configuration of their sample, or, given a certain method, increasing the sample size or altering the amount of molecular diversity. This can be achieved either by sequencing more mtDNA or by sequencing additional nuclear genes. In the latter case, they may also have to modify their data analysis method.
format Text
id pubmed-2775147
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-27751472009-11-10 DNA barcode analysis: a comparison of phylogenetic and statistical classification methods Austerlitz, Frederic David, Olivier Schaeffer, Brigitte Bleakley, Kevin Olteanu, Madalina Leblois, Raphael Veuille, Michel Laredo, Catherine BMC Bioinformatics Research BACKGROUND: DNA barcoding aims to assign individuals to given species according to their sequence at a small locus, generally part of the CO1 mitochondrial gene. Amongst other issues, this raises the question of how to deal with within-species genetic variability and potential transpecific polymorphism. In this context, we examine several assignation methods belonging to two main categories: (i) phylogenetic methods (neighbour-joining and PhyML) that attempt to account for the genealogical framework of DNA evolution and (ii) supervised classification methods (k-nearest neighbour, CART, random forest and kernel methods). These methods range from basic to elaborate. We investigated the ability of each method to correctly classify query sequences drawn from samples of related species using both simulated and real data. Simulated data sets were generated using coalescent simulations in which we varied the genealogical history, mutation parameter, sample size and number of species. RESULTS: No method was found to be the best in all cases. The simplest method of all, "one nearest neighbour", was found to be the most reliable with respect to changes in the parameters of the data sets. The parameter most influencing the performance of the various methods was molecular diversity of the data. Addition of genetically independent loci - nuclear genes - improved the predictive performance of most methods. CONCLUSION: The study implies that taxonomists can influence the quality of their analyses either by choosing a method best-adapted to the configuration of their sample, or, given a certain method, increasing the sample size or altering the amount of molecular diversity. This can be achieved either by sequencing more mtDNA or by sequencing additional nuclear genes. In the latter case, they may also have to modify their data analysis method. BioMed Central 2009-11-10 /pmc/articles/PMC2775147/ /pubmed/19900297 http://dx.doi.org/10.1186/1471-2105-10-S14-S10 Text en Copyright © 2009 Austerlitz et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided th original work is properly cited.
spellingShingle Research
Austerlitz, Frederic
David, Olivier
Schaeffer, Brigitte
Bleakley, Kevin
Olteanu, Madalina
Leblois, Raphael
Veuille, Michel
Laredo, Catherine
DNA barcode analysis: a comparison of phylogenetic and statistical classification methods
title DNA barcode analysis: a comparison of phylogenetic and statistical classification methods
title_full DNA barcode analysis: a comparison of phylogenetic and statistical classification methods
title_fullStr DNA barcode analysis: a comparison of phylogenetic and statistical classification methods
title_full_unstemmed DNA barcode analysis: a comparison of phylogenetic and statistical classification methods
title_short DNA barcode analysis: a comparison of phylogenetic and statistical classification methods
title_sort dna barcode analysis: a comparison of phylogenetic and statistical classification methods
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2775147/
https://www.ncbi.nlm.nih.gov/pubmed/19900297
http://dx.doi.org/10.1186/1471-2105-10-S14-S10
work_keys_str_mv AT austerlitzfrederic dnabarcodeanalysisacomparisonofphylogeneticandstatisticalclassificationmethods
AT davidolivier dnabarcodeanalysisacomparisonofphylogeneticandstatisticalclassificationmethods
AT schaefferbrigitte dnabarcodeanalysisacomparisonofphylogeneticandstatisticalclassificationmethods
AT bleakleykevin dnabarcodeanalysisacomparisonofphylogeneticandstatisticalclassificationmethods
AT olteanumadalina dnabarcodeanalysisacomparisonofphylogeneticandstatisticalclassificationmethods
AT lebloisraphael dnabarcodeanalysisacomparisonofphylogeneticandstatisticalclassificationmethods
AT veuillemichel dnabarcodeanalysisacomparisonofphylogeneticandstatisticalclassificationmethods
AT laredocatherine dnabarcodeanalysisacomparisonofphylogeneticandstatisticalclassificationmethods