Cargando…

From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools

In metagenomic analyses of microbiomes, one of the first steps is usually the taxonomic classification of reads by comparison to a database of previously taxonomically classified genomes. While different studies comparing metagenomic taxonomic classification methods have determined that different to...

Descripción completa

Detalles Bibliográficos
Autores principales: Wright, Robyn J., Comeau, Andrè M., Langille, Morgan G. I.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Microbiology Society 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10132073/
https://www.ncbi.nlm.nih.gov/pubmed/36867161
http://dx.doi.org/10.1099/mgen.0.000949
_version_ 1785031321377046528
author Wright, Robyn J.
Comeau, Andrè M.
Langille, Morgan G. I.
author_facet Wright, Robyn J.
Comeau, Andrè M.
Langille, Morgan G. I.
author_sort Wright, Robyn J.
collection PubMed
description In metagenomic analyses of microbiomes, one of the first steps is usually the taxonomic classification of reads by comparison to a database of previously taxonomically classified genomes. While different studies comparing metagenomic taxonomic classification methods have determined that different tools are ‘best’, there are two tools that have been used the most to-date: Kraken (k-mer-based classification against a user-constructed database) and MetaPhlAn (classification by alignment to clade-specific marker genes), the latest versions of which are Kraken2 and MetaPhlAn 3, respectively. We found large discrepancies in both the proportion of reads that were classified as well as the number of species that were identified when we used both Kraken2 and MetaPhlAn 3 to classify reads within metagenomes from human-associated or environmental datasets. We then investigated which of these tools would give classifications closest to the real composition of metagenomic samples using a range of simulated and mock samples and examined the combined impact of tool–parameter–database choice on the taxonomic classifications given. This revealed that there may not be a one-size-fits-all ‘best’ choice. While Kraken2 can achieve better overall performance, with higher precision, recall and F1 scores, as well as alpha- and beta-diversity measures closer to the known composition than MetaPhlAn 3, the computational resources required for this may be prohibitive for many researchers, and the default database and parameters should not be used. We therefore conclude that the best tool–parameter–database choice for a particular application depends on the scientific question of interest, which performance metric is most important for this question and the limit of available computational resources.
format Online
Article
Text
id pubmed-10132073
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Microbiology Society
record_format MEDLINE/PubMed
spelling pubmed-101320732023-04-27 From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools Wright, Robyn J. Comeau, Andrè M. Langille, Morgan G. I. Microb Genom Research Articles In metagenomic analyses of microbiomes, one of the first steps is usually the taxonomic classification of reads by comparison to a database of previously taxonomically classified genomes. While different studies comparing metagenomic taxonomic classification methods have determined that different tools are ‘best’, there are two tools that have been used the most to-date: Kraken (k-mer-based classification against a user-constructed database) and MetaPhlAn (classification by alignment to clade-specific marker genes), the latest versions of which are Kraken2 and MetaPhlAn 3, respectively. We found large discrepancies in both the proportion of reads that were classified as well as the number of species that were identified when we used both Kraken2 and MetaPhlAn 3 to classify reads within metagenomes from human-associated or environmental datasets. We then investigated which of these tools would give classifications closest to the real composition of metagenomic samples using a range of simulated and mock samples and examined the combined impact of tool–parameter–database choice on the taxonomic classifications given. This revealed that there may not be a one-size-fits-all ‘best’ choice. While Kraken2 can achieve better overall performance, with higher precision, recall and F1 scores, as well as alpha- and beta-diversity measures closer to the known composition than MetaPhlAn 3, the computational resources required for this may be prohibitive for many researchers, and the default database and parameters should not be used. We therefore conclude that the best tool–parameter–database choice for a particular application depends on the scientific question of interest, which performance metric is most important for this question and the limit of available computational resources. Microbiology Society 2023-03-03 /pmc/articles/PMC10132073/ /pubmed/36867161 http://dx.doi.org/10.1099/mgen.0.000949 Text en © 2023 The Authors https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License.
spellingShingle Research Articles
Wright, Robyn J.
Comeau, Andrè M.
Langille, Morgan G. I.
From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools
title From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools
title_full From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools
title_fullStr From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools
title_full_unstemmed From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools
title_short From defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools
title_sort from defaults to databases: parameter and database choice dramatically impact the performance of metagenomic taxonomic classification tools
topic Research Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10132073/
https://www.ncbi.nlm.nih.gov/pubmed/36867161
http://dx.doi.org/10.1099/mgen.0.000949
work_keys_str_mv AT wrightrobynj fromdefaultstodatabasesparameteranddatabasechoicedramaticallyimpacttheperformanceofmetagenomictaxonomicclassificationtools
AT comeauandrem fromdefaultstodatabasesparameteranddatabasechoicedramaticallyimpacttheperformanceofmetagenomictaxonomicclassificationtools
AT langillemorgangi fromdefaultstodatabasesparameteranddatabasechoicedramaticallyimpacttheperformanceofmetagenomictaxonomicclassificationtools