Cargando…
A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods
Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spac...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2012
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3282726/ https://www.ncbi.nlm.nih.gov/pubmed/22363527 http://dx.doi.org/10.1371/journal.pone.0030986 |
_version_ | 1782224112314220544 |
---|---|
author | Zhang, Ai-bing Feng, Jie Ward, Robert D. Wan, Ping Gao, Qiang Wu, Jun Zhao, Wei-zhong |
author_facet | Zhang, Ai-bing Feng, Jie Ward, Robert D. Wan, Ping Gao, Qiang Wu, Jun Zhao, Wei-zhong |
author_sort | Zhang, Ai-bing |
collection | PubMed |
description | Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75–100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62–98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60–99.37%) for 1094 brown algae queries, both using ITS barcodes. |
format | Online Article Text |
id | pubmed-3282726 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2012 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-32827262012-02-23 A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods Zhang, Ai-bing Feng, Jie Ward, Robert D. Wan, Ping Gao, Qiang Wu, Jun Zhao, Wei-zhong PLoS One Research Article Species identification via DNA barcodes is contributing greatly to current bioinventory efforts. The initial, and widely accepted, proposal was to use the protein-coding cytochrome c oxidase subunit I (COI) region as the standard barcode for animals, but recently non-coding internal transcribed spacer (ITS) genes have been proposed as candidate barcodes for both animals and plants. However, achieving a robust alignment for non-coding regions can be problematic. Here we propose two new methods (DV-RBF and FJ-RBF) to address this issue for species assignment by both coding and non-coding sequences that take advantage of the power of machine learning and bioinformatics. We demonstrate the value of the new methods with four empirical datasets, two representing typical protein-coding COI barcode datasets (neotropical bats and marine fish) and two representing non-coding ITS barcodes (rust fungi and brown algae). Using two random sub-sampling approaches, we demonstrate that the new methods significantly outperformed existing Neighbor-joining (NJ) and Maximum likelihood (ML) methods for both coding and non-coding barcodes when there was complete species coverage in the reference dataset. The new methods also out-performed NJ and ML methods for non-coding sequences in circumstances of potentially incomplete species coverage, although then the NJ and ML methods performed slightly better than the new methods for protein-coding barcodes. A 100% success rate of species identification was achieved with the two new methods for 4,122 bat queries and 5,134 fish queries using COI barcodes, with 95% confidence intervals (CI) of 99.75–100%. The new methods also obtained a 96.29% success rate (95%CI: 91.62–98.40%) for 484 rust fungi queries and a 98.50% success rate (95%CI: 96.60–99.37%) for 1094 brown algae queries, both using ITS barcodes. Public Library of Science 2012-02-20 /pmc/articles/PMC3282726/ /pubmed/22363527 http://dx.doi.org/10.1371/journal.pone.0030986 Text en Zhang et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited. |
spellingShingle | Research Article Zhang, Ai-bing Feng, Jie Ward, Robert D. Wan, Ping Gao, Qiang Wu, Jun Zhao, Wei-zhong A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods |
title | A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods |
title_full | A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods |
title_fullStr | A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods |
title_full_unstemmed | A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods |
title_short | A New Method for Species Identification via Protein-Coding and Non-Coding DNA Barcodes by Combining Machine Learning with Bioinformatic Methods |
title_sort | new method for species identification via protein-coding and non-coding dna barcodes by combining machine learning with bioinformatic methods |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3282726/ https://www.ncbi.nlm.nih.gov/pubmed/22363527 http://dx.doi.org/10.1371/journal.pone.0030986 |
work_keys_str_mv | AT zhangaibing anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT fengjie anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT wardrobertd anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT wanping anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT gaoqiang anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT wujun anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT zhaoweizhong anewmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT zhangaibing newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT fengjie newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT wardrobertd newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT wanping newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT gaoqiang newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT wujun newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods AT zhaoweizhong newmethodforspeciesidentificationviaproteincodingandnoncodingdnabarcodesbycombiningmachinelearningwithbioinformaticmethods |