Cargando…

Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms

BACKGROUND: Currently, the naïve Bayesian classifier provided by the Ribosomal Database Project (RDP) is one of the most widely used tools to classify 16S rRNA sequences, mainly collected from environmental samples. We show that RDP has 97+% assignment accuracy and is fast for 250 bp and longer read...

Descripción completa

Detalles Bibliográficos
Autores principales: Lan, Yemin, Wang, Qiong, Cole, James R., Rosen, Gail L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3293824/
https://www.ncbi.nlm.nih.gov/pubmed/22403664
http://dx.doi.org/10.1371/journal.pone.0032491
_version_ 1782225434828603392
author Lan, Yemin
Wang, Qiong
Cole, James R.
Rosen, Gail L.
author_facet Lan, Yemin
Wang, Qiong
Cole, James R.
Rosen, Gail L.
author_sort Lan, Yemin
collection PubMed
description BACKGROUND: Currently, the naïve Bayesian classifier provided by the Ribosomal Database Project (RDP) is one of the most widely used tools to classify 16S rRNA sequences, mainly collected from environmental samples. We show that RDP has 97+% assignment accuracy and is fast for 250 bp and longer reads when the read originates from a taxon known to the database. Because most environmental samples will contain organisms from taxa whose 16S rRNA genes have not been previously sequenced, we aim to benchmark how well the RDP classifier and other competing methods can discriminate these novel taxa from known taxa. PRINCIPAL FINDINGS: Because each fragment is assigned a score (containing likelihood or confidence information such as the boostrap score in the RDP classifier), we “train” a threshold to discriminate between novel and known organisms and observe its performance on a test set. The threshold that we determine tends to be conservative (low sensitivity but high specificity) for naïve Bayesian methods. Nonetheless, our method performs better with the RDP classifier than the other methods tested, measured by the f-measure and the area-under-the-curve on the receiver operating characteristic of the test set. By constraining the database to well-represented genera, sensitivity improves 3–15%. Finally, we show that the detector is a good predictor to determine novel abundant taxa (especially for finer levels of taxonomy where novelty is more likely to be present). CONCLUSIONS: We conclude that selecting a read-length appropriate RDP bootstrap score can significantly reduce the search space for identifying novel genera and higher levels in taxonomy. In addition, having a well-represented database significantly improves performance while having genera that are “highly” similar does not make a significant improvement. On a real dataset from an Amazon Terra Preta soil sample, we show that the detector can predict (or correlates to) whether novel sequences will be assigned to new taxa when the RDP database “doubles” in the future.
format Online
Article
Text
id pubmed-3293824
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-32938242012-03-08 Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms Lan, Yemin Wang, Qiong Cole, James R. Rosen, Gail L. PLoS One Research Article BACKGROUND: Currently, the naïve Bayesian classifier provided by the Ribosomal Database Project (RDP) is one of the most widely used tools to classify 16S rRNA sequences, mainly collected from environmental samples. We show that RDP has 97+% assignment accuracy and is fast for 250 bp and longer reads when the read originates from a taxon known to the database. Because most environmental samples will contain organisms from taxa whose 16S rRNA genes have not been previously sequenced, we aim to benchmark how well the RDP classifier and other competing methods can discriminate these novel taxa from known taxa. PRINCIPAL FINDINGS: Because each fragment is assigned a score (containing likelihood or confidence information such as the boostrap score in the RDP classifier), we “train” a threshold to discriminate between novel and known organisms and observe its performance on a test set. The threshold that we determine tends to be conservative (low sensitivity but high specificity) for naïve Bayesian methods. Nonetheless, our method performs better with the RDP classifier than the other methods tested, measured by the f-measure and the area-under-the-curve on the receiver operating characteristic of the test set. By constraining the database to well-represented genera, sensitivity improves 3–15%. Finally, we show that the detector is a good predictor to determine novel abundant taxa (especially for finer levels of taxonomy where novelty is more likely to be present). CONCLUSIONS: We conclude that selecting a read-length appropriate RDP bootstrap score can significantly reduce the search space for identifying novel genera and higher levels in taxonomy. In addition, having a well-represented database significantly improves performance while having genera that are “highly” similar does not make a significant improvement. On a real dataset from an Amazon Terra Preta soil sample, we show that the detector can predict (or correlates to) whether novel sequences will be assigned to new taxa when the RDP database “doubles” in the future. Public Library of Science 2012-03-05 /pmc/articles/PMC3293824/ /pubmed/22403664 http://dx.doi.org/10.1371/journal.pone.0032491 Text en Lan et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Lan, Yemin
Wang, Qiong
Cole, James R.
Rosen, Gail L.
Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms
title Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms
title_full Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms
title_fullStr Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms
title_full_unstemmed Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms
title_short Using the RDP Classifier to Predict Taxonomic Novelty and Reduce the Search Space for Finding Novel Organisms
title_sort using the rdp classifier to predict taxonomic novelty and reduce the search space for finding novel organisms
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3293824/
https://www.ncbi.nlm.nih.gov/pubmed/22403664
http://dx.doi.org/10.1371/journal.pone.0032491
work_keys_str_mv AT lanyemin usingtherdpclassifiertopredicttaxonomicnoveltyandreducethesearchspaceforfindingnovelorganisms
AT wangqiong usingtherdpclassifiertopredicttaxonomicnoveltyandreducethesearchspaceforfindingnovelorganisms
AT colejamesr usingtherdpclassifiertopredicttaxonomicnoveltyandreducethesearchspaceforfindingnovelorganisms
AT rosengaill usingtherdpclassifiertopredicttaxonomicnoveltyandreducethesearchspaceforfindingnovelorganisms