Cargando…

A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods

Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spac...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhang, Ai-bing, Feng, Jie, Ward, Robert D., Wan, Ping, Gao, Qiang, Wu, Jun, Zhao, Wei-zhong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3282726/
https://www.ncbi.nlm.nih.gov/pubmed/22363527
http://dx.doi.org/10.1371/journal.pone.0030986
_version_ 1782224112314220544
author Zhang, Ai-bing
Feng, Jie
Ward, Robert D.
Wan, Ping
Gao, Qiang
Wu, Jun
Zhao, Wei-zhong
author_facet Zhang, Ai-bing
Feng, Jie
Ward, Robert D.
Wan, Ping
Gao, Qiang
Wu, Jun
Zhao, Wei-zhong
author_sort Zhang, Ai-bing
collection PubMed
description Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75–100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62–98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60–99.37%) for 1094 brown algae queries, both using ITS barcodes.
format Online
Article
Text
id pubmed-3282726
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-32827262012-02-23 A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods Zhang, Ai-bing Feng, Jie Ward, Robert D. Wan, Ping Gao, Qiang Wu, Jun Zhao, Wei-zhong PLoS One Research Article Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75–100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62–98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60–99.37%) for 1094 brown algae queries, both using ITS barcodes. Public Library of Science 2012-02-20 /pmc/articles/PMC3282726/ /pubmed/22363527 http://dx.doi.org/10.1371/journal.pone.0030986 Text en Zhang et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Zhang, Ai-bing
Feng, Jie
Ward, Robert D.
Wan, Ping
Gao, Qiang
Wu, Jun
Zhao, Wei-zhong
A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods
title A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods
title_full A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods
title_fullStr A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods
title_full_unstemmed A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods
title_short A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods
title_sort new method for species identification via protein-coding and non-coding dna barcodes by combining machine learning with bioinformatic methods
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3282726/
https://www.ncbi.nlm.nih.gov/pubmed/22363527
http://dx.doi.org/10.1371/journal.pone.0030986
work_keys_str_mv AT zhangaibing anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT fengjie anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT wardrobertd anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT wanping anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT gaoqiang anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT wujun anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT zhaoweizhong anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT zhangaibing newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT fengjie newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT wardrobertd newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT wanping newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT gaoqiang newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT wujun newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods
AT zhaoweizhong newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods