Cargando…

Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier

Current methods to identify unknown insect (class Insecta) cytochrome c oxidase (COI barcode) sequences often rely on thresholds of distances that can be difficult to define, sequence similarity cut-offs, or monophyly. Some of the most commonly used metagenomic classification methods do not provide...

Descripción completa

Detalles Bibliográficos
Autores principales: Porter, Teresita M, Gibson, Joel F, Shokralla, Shadi, Baird, Donald J, Golding, G Brian, Hajibabaei, Mehrdad
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BlackWell Publishing Ltd 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4282328/
http://dx.doi.org/10.1111/1755-0998.12240
_version_ 1782351115175591936
author Porter, Teresita M
Gibson, Joel F
Shokralla, Shadi
Baird, Donald J
Golding, G Brian
Hajibabaei, Mehrdad
author_facet Porter, Teresita M
Gibson, Joel F
Shokralla, Shadi
Baird, Donald J
Golding, G Brian
Hajibabaei, Mehrdad
author_sort Porter, Teresita M
collection PubMed
description Current methods to identify unknown insect (class Insecta) cytochrome c oxidase (COI barcode) sequences often rely on thresholds of distances that can be difficult to define, sequence similarity cut-offs, or monophyly. Some of the most commonly used metagenomic classification methods do not provide a measure of confidence for the taxonomic assignments they provide. The aim of this study was to use a naïve Bayesian classifier (Wang et al. Applied and Environmental Microbiology, 2007; 73: 5261) to automate taxonomic assignments for large batches of insect COI sequences such as data obtained from high-throughput environmental sequencing. This method provides rank-flexible taxonomic assignments with an associated bootstrap support value, and it is faster than the blast-based methods commonly used in environmental sequence surveys. We have developed and rigorously tested the performance of three different training sets using leave-one-out cross-validation, two field data sets, and targeted testing of Lepidoptera, Diptera and Mantodea sequences obtained from the Barcode of Life Data system. We found that type I error rates, incorrect taxonomic assignments with a high bootstrap support, were already relatively low but could be lowered further by ensuring that all query taxa are actually present in the reference database. Choosing bootstrap support cut-offs according to query length and summarizing taxonomic assignments to more inclusive ranks can also help to reduce error while retaining the maximum number of assignments. Additionally, we highlight gaps in the taxonomic and geographic representation of insects in public sequence databases that will require further work by taxonomists to improve the quality of assignments generated using any method.
format Online
Article
Text
id pubmed-4282328
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BlackWell Publishing Ltd
record_format MEDLINE/PubMed
spelling pubmed-42823282015-01-26 Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier Porter, Teresita M Gibson, Joel F Shokralla, Shadi Baird, Donald J Golding, G Brian Hajibabaei, Mehrdad Mol Ecol Resour Resource Articles Current methods to identify unknown insect (class Insecta) cytochrome c oxidase (COI barcode) sequences often rely on thresholds of distances that can be difficult to define, sequence similarity cut-offs, or monophyly. Some of the most commonly used metagenomic classification methods do not provide a measure of confidence for the taxonomic assignments they provide. The aim of this study was to use a naïve Bayesian classifier (Wang et al. Applied and Environmental Microbiology, 2007; 73: 5261) to automate taxonomic assignments for large batches of insect COI sequences such as data obtained from high-throughput environmental sequencing. This method provides rank-flexible taxonomic assignments with an associated bootstrap support value, and it is faster than the blast-based methods commonly used in environmental sequence surveys. We have developed and rigorously tested the performance of three different training sets using leave-one-out cross-validation, two field data sets, and targeted testing of Lepidoptera, Diptera and Mantodea sequences obtained from the Barcode of Life Data system. We found that type I error rates, incorrect taxonomic assignments with a high bootstrap support, were already relatively low but could be lowered further by ensuring that all query taxa are actually present in the reference database. Choosing bootstrap support cut-offs according to query length and summarizing taxonomic assignments to more inclusive ranks can also help to reduce error while retaining the maximum number of assignments. Additionally, we highlight gaps in the taxonomic and geographic representation of insects in public sequence databases that will require further work by taxonomists to improve the quality of assignments generated using any method. BlackWell Publishing Ltd 2014-09 2014-03-19 /pmc/articles/PMC4282328/ http://dx.doi.org/10.1111/1755-0998.12240 Text en © 2014 Her Majesty the Queen in Right of Canada. Molecular Ecology Resources Published by John Wiley & Sons Ltd Reproduced with the permission of the Minister of Environment. http://creativecommons.org/licenses/by/3.0/ This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
spellingShingle Resource Articles
Porter, Teresita M
Gibson, Joel F
Shokralla, Shadi
Baird, Donald J
Golding, G Brian
Hajibabaei, Mehrdad
Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier
title Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier
title_full Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier
title_fullStr Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier
title_full_unstemmed Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier
title_short Rapid and accurate taxonomic classification of insect (class Insecta) cytochrome c oxidase subunit 1 (COI) DNA barcode sequences using a naïve Bayesian classifier
title_sort rapid and accurate taxonomic classification of insect (class insecta) cytochrome c oxidase subunit 1 (coi) dna barcode sequences using a naïve bayesian classifier
topic Resource Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4282328/
http://dx.doi.org/10.1111/1755-0998.12240
work_keys_str_mv AT porterteresitam rapidandaccuratetaxonomicclassificationofinsectclassinsectacytochromecoxidasesubunit1coidnabarcodesequencesusinganaivebayesianclassifier
AT gibsonjoelf rapidandaccuratetaxonomicclassificationofinsectclassinsectacytochromecoxidasesubunit1coidnabarcodesequencesusinganaivebayesianclassifier
AT shokrallashadi rapidandaccuratetaxonomicclassificationofinsectclassinsectacytochromecoxidasesubunit1coidnabarcodesequencesusinganaivebayesianclassifier
AT bairddonaldj rapidandaccuratetaxonomicclassificationofinsectclassinsectacytochromecoxidasesubunit1coidnabarcodesequencesusinganaivebayesianclassifier
AT goldinggbrian rapidandaccuratetaxonomicclassificationofinsectclassinsectacytochromecoxidasesubunit1coidnabarcodesequencesusinganaivebayesianclassifier
AT hajibabaeimehrdad rapidandaccuratetaxonomicclassificationofinsectclassinsectacytochromecoxidasesubunit1coidnabarcodesequencesusinganaivebayesianclassifier