Cargando…

IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences

BACKGROUND: Microbiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest. Sequence classification is a critical component of this process, whereby sequences are assigned to a reference taxonomy containing known sequence representatives of many micro...

Descripción completa

Detalles Bibliográficos
Autores principales: Murali, Adithya, Bhargava, Aniruddha, Wright, Erik S.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6085705/
https://www.ncbi.nlm.nih.gov/pubmed/30092815
http://dx.doi.org/10.1186/s40168-018-0521-5
_version_ 1783346388652785664
author Murali, Adithya
Bhargava, Aniruddha
Wright, Erik S.
author_facet Murali, Adithya
Bhargava, Aniruddha
Wright, Erik S.
author_sort Murali, Adithya
collection PubMed
description BACKGROUND: Microbiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest. Sequence classification is a critical component of this process, whereby sequences are assigned to a reference taxonomy containing known sequence representatives of many microbial groups. Previous studies have shown that existing classification programs often assign sequences to reference groups even if they belong to novel taxonomic groups that are absent from the reference taxonomy. This high rate of “over classification” is particularly detrimental in microbiome studies because reference taxonomies are far from comprehensive. RESULTS: Here, we introduce IDTAXA, a novel approach to taxonomic classification that employs principles from machine learning to reduce over classification errors. Using multiple reference taxonomies, we demonstrate that IDTAXA has higher accuracy than popular classifiers such as BLAST, MAPSeq, QIIME, SINTAX, SPINGO, and the RDP Classifier. Similarly, IDTAXA yields far fewer over classifications on Illumina mock microbial community data when the expected taxa are absent from the training set. Furthermore, IDTAXA offers many practical advantages over other classifiers, such as maintaining low error rates across varying input sequence lengths and withholding classifications from input sequences composed of random nucleotides or repeats. CONCLUSIONS: IDTAXA’s classifications may lead to different conclusions in microbiome studies because of the substantially reduced number of taxa that are incorrectly identified through over classification. Although misclassification error is relatively minor, we believe that many remaining misclassifications are likely caused by errors in the reference taxonomy. We describe how IDTAXA is able to identify many putative mislabeling errors in reference taxonomies, enabling training sets to be automatically corrected by eliminating spurious sequences. IDTAXA is part of the DECIPHER package for the R programming language, available through the Bioconductor repository or accessible online (http://DECIPHER.codes). ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s40168-018-0521-5) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6085705
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-60857052018-08-16 IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences Murali, Adithya Bhargava, Aniruddha Wright, Erik S. Microbiome Software BACKGROUND: Microbiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest. Sequence classification is a critical component of this process, whereby sequences are assigned to a reference taxonomy containing known sequence representatives of many microbial groups. Previous studies have shown that existing classification programs often assign sequences to reference groups even if they belong to novel taxonomic groups that are absent from the reference taxonomy. This high rate of “over classification” is particularly detrimental in microbiome studies because reference taxonomies are far from comprehensive. RESULTS: Here, we introduce IDTAXA, a novel approach to taxonomic classification that employs principles from machine learning to reduce over classification errors. Using multiple reference taxonomies, we demonstrate that IDTAXA has higher accuracy than popular classifiers such as BLAST, MAPSeq, QIIME, SINTAX, SPINGO, and the RDP Classifier. Similarly, IDTAXA yields far fewer over classifications on Illumina mock microbial community data when the expected taxa are absent from the training set. Furthermore, IDTAXA offers many practical advantages over other classifiers, such as maintaining low error rates across varying input sequence lengths and withholding classifications from input sequences composed of random nucleotides or repeats. CONCLUSIONS: IDTAXA’s classifications may lead to different conclusions in microbiome studies because of the substantially reduced number of taxa that are incorrectly identified through over classification. Although misclassification error is relatively minor, we believe that many remaining misclassifications are likely caused by errors in the reference taxonomy. We describe how IDTAXA is able to identify many putative mislabeling errors in reference taxonomies, enabling training sets to be automatically corrected by eliminating spurious sequences. IDTAXA is part of the DECIPHER package for the R programming language, available through the Bioconductor repository or accessible online (http://DECIPHER.codes). ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s40168-018-0521-5) contains supplementary material, which is available to authorized users. BioMed Central 2018-08-09 /pmc/articles/PMC6085705/ /pubmed/30092815 http://dx.doi.org/10.1186/s40168-018-0521-5 Text en © The Author(s). 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Murali, Adithya
Bhargava, Aniruddha
Wright, Erik S.
IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences
title IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences
title_full IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences
title_fullStr IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences
title_full_unstemmed IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences
title_short IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences
title_sort idtaxa: a novel approach for accurate taxonomic classification of microbiome sequences
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6085705/
https://www.ncbi.nlm.nih.gov/pubmed/30092815
http://dx.doi.org/10.1186/s40168-018-0521-5
work_keys_str_mv AT muraliadithya idtaxaanovelapproachforaccuratetaxonomicclassificationofmicrobiomesequences
AT bhargavaaniruddha idtaxaanovelapproachforaccuratetaxonomicclassificationofmicrobiomesequences
AT wrighteriks idtaxaanovelapproachforaccuratetaxonomicclassificationofmicrobiomesequences