Cargando…

Evaluation of computational methods for human microbiome analysis using simulated data

BACKGROUND: Our understanding of the composition, function, and health implications of human microbiota has been advanced by high-throughput sequencing and the development of new genomic analyses. However, trade-offs among alternative strategies for the acquisition and analysis of sequence data rema...

Descripción completa

Detalles Bibliográficos
Autores principales: Miossec, Matthieu J., Valenzuela, Sandro L., Pérez-Losada, Marcos, Johnson, W. Evan, Crandall, Keith A., Castro-Nallar, Eduardo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7427543/
https://www.ncbi.nlm.nih.gov/pubmed/32864214
http://dx.doi.org/10.7717/peerj.9688
_version_ 1783570897126293504
author Miossec, Matthieu J.
Valenzuela, Sandro L.
Pérez-Losada, Marcos
Johnson, W. Evan
Crandall, Keith A.
Castro-Nallar, Eduardo
author_facet Miossec, Matthieu J.
Valenzuela, Sandro L.
Pérez-Losada, Marcos
Johnson, W. Evan
Crandall, Keith A.
Castro-Nallar, Eduardo
author_sort Miossec, Matthieu J.
collection PubMed
description BACKGROUND: Our understanding of the composition, function, and health implications of human microbiota has been advanced by high-throughput sequencing and the development of new genomic analyses. However, trade-offs among alternative strategies for the acquisition and analysis of sequence data remain understudied. METHODS: We assessed eight popular taxonomic profiling pipelines; MetaPhlAn2, metaMix, PathoScope 2.0, Sigma, Kraken, ConStrains, Centrifuge and Taxator-tk, against a battery of metagenomic datasets simulated from real data. The metagenomic datasets were modeled on 426 complete or permanent draft genomes stored in the Human Oral Microbiome Database and were designed to simulate various experimental conditions, both in the design of a putative experiment; read length (75–1,000 bp reads), sequence depth (100K–10M), and in metagenomic composition; number of species present (10, 100, 426), species distribution. The sensitivity and specificity of each of the pipelines under various scenarios were measured. We also estimated the relative root mean square error and average relative error to assess the abundance estimates produced by different methods. Additional datasets were generated for five of the pipelines to simulate the presence within a metagenome of an unreferenced species, closely related to other referenced species. Additional datasets were also generated in order to measure computational time on datasets of ever-increasing sequencing depth (up to 6 × 10(7)). RESULTS: Testing of eight pipelines against 144 simulated metagenomic datasets initially produced 1,104 discrete results. Pipelines using a marker gene strategy; MetaPhlAn2 and ConStrains, were overall less sensitive, than other pipelines; with the notable exception of Taxator-tk. This difference in sensitivity was largely made up in terms of runtime, significantly lower than more sensitive pipelines that rely on whole-genome alignments such as PathoScope2.0. However, pipelines that used strategies to speed-up alignment between genomic references and metagenomic reads, such as kmerization, were able to combine both high sensitivity and low run time, as is the case with Kraken and Centrifuge. Absent species genomes in the database mostly led to assignment of reads to the most closely related species available in all pipelines. Our results therefore suggest that taxonomic profilers that use kmerization have largely superseded those that use gene markers, coupling low run times with high sensitivity and specificity. Taxonomic profilers using more time-consuming read reassignment, such as PathoScope 2.0, provided the most sensitive profiles under common metagenomic sequencing scenarios. All the results described and discussed in this paper can be visualized using the dedicated R Shiny application (https://github.com/microgenomics/HumanMicrobiomeAnalysis). All of our datasets, pipelines and results are made available through the GitHub repository for future benchmarking.
format Online
Article
Text
id pubmed-7427543
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-74275432020-08-27 Evaluation of computational methods for human microbiome analysis using simulated data Miossec, Matthieu J. Valenzuela, Sandro L. Pérez-Losada, Marcos Johnson, W. Evan Crandall, Keith A. Castro-Nallar, Eduardo PeerJ Bioinformatics BACKGROUND: Our understanding of the composition, function, and health implications of human microbiota has been advanced by high-throughput sequencing and the development of new genomic analyses. However, trade-offs among alternative strategies for the acquisition and analysis of sequence data remain understudied. METHODS: We assessed eight popular taxonomic profiling pipelines; MetaPhlAn2, metaMix, PathoScope 2.0, Sigma, Kraken, ConStrains, Centrifuge and Taxator-tk, against a battery of metagenomic datasets simulated from real data. The metagenomic datasets were modeled on 426 complete or permanent draft genomes stored in the Human Oral Microbiome Database and were designed to simulate various experimental conditions, both in the design of a putative experiment; read length (75–1,000 bp reads), sequence depth (100K–10M), and in metagenomic composition; number of species present (10, 100, 426), species distribution. The sensitivity and specificity of each of the pipelines under various scenarios were measured. We also estimated the relative root mean square error and average relative error to assess the abundance estimates produced by different methods. Additional datasets were generated for five of the pipelines to simulate the presence within a metagenome of an unreferenced species, closely related to other referenced species. Additional datasets were also generated in order to measure computational time on datasets of ever-increasing sequencing depth (up to 6 × 10(7)). RESULTS: Testing of eight pipelines against 144 simulated metagenomic datasets initially produced 1,104 discrete results. Pipelines using a marker gene strategy; MetaPhlAn2 and ConStrains, were overall less sensitive, than other pipelines; with the notable exception of Taxator-tk. This difference in sensitivity was largely made up in terms of runtime, significantly lower than more sensitive pipelines that rely on whole-genome alignments such as PathoScope2.0. However, pipelines that used strategies to speed-up alignment between genomic references and metagenomic reads, such as kmerization, were able to combine both high sensitivity and low run time, as is the case with Kraken and Centrifuge. Absent species genomes in the database mostly led to assignment of reads to the most closely related species available in all pipelines. Our results therefore suggest that taxonomic profilers that use kmerization have largely superseded those that use gene markers, coupling low run times with high sensitivity and specificity. Taxonomic profilers using more time-consuming read reassignment, such as PathoScope 2.0, provided the most sensitive profiles under common metagenomic sequencing scenarios. All the results described and discussed in this paper can be visualized using the dedicated R Shiny application (https://github.com/microgenomics/HumanMicrobiomeAnalysis). All of our datasets, pipelines and results are made available through the GitHub repository for future benchmarking. PeerJ Inc. 2020-08-11 /pmc/articles/PMC7427543/ /pubmed/32864214 http://dx.doi.org/10.7717/peerj.9688 Text en © 2020 Miossec et al. https://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.
spellingShingle Bioinformatics
Miossec, Matthieu J.
Valenzuela, Sandro L.
Pérez-Losada, Marcos
Johnson, W. Evan
Crandall, Keith A.
Castro-Nallar, Eduardo
Evaluation of computational methods for human microbiome analysis using simulated data
title Evaluation of computational methods for human microbiome analysis using simulated data
title_full Evaluation of computational methods for human microbiome analysis using simulated data
title_fullStr Evaluation of computational methods for human microbiome analysis using simulated data
title_full_unstemmed Evaluation of computational methods for human microbiome analysis using simulated data
title_short Evaluation of computational methods for human microbiome analysis using simulated data
title_sort evaluation of computational methods for human microbiome analysis using simulated data
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7427543/
https://www.ncbi.nlm.nih.gov/pubmed/32864214
http://dx.doi.org/10.7717/peerj.9688
work_keys_str_mv AT miossecmatthieuj evaluationofcomputationalmethodsforhumanmicrobiomeanalysisusingsimulateddata
AT valenzuelasandrol evaluationofcomputationalmethodsforhumanmicrobiomeanalysisusingsimulateddata
AT perezlosadamarcos evaluationofcomputationalmethodsforhumanmicrobiomeanalysisusingsimulateddata
AT johnsonwevan evaluationofcomputationalmethodsforhumanmicrobiomeanalysisusingsimulateddata
AT crandallkeitha evaluationofcomputationalmethodsforhumanmicrobiomeanalysisusingsimulateddata
AT castronallareduardo evaluationofcomputationalmethodsforhumanmicrobiomeanalysisusingsimulateddata