Cargando…

Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics

BACKGROUND: Metagenomics is the study of microbial communities by sequencing of genetic material directly from environmental or clinical samples. The genes present in the metagenomes are quantified by annotating and counting the generated DNA fragments. Identification of differentially abundant gene...

Descripción completa

Detalles Bibliográficos
Autores principales: Jonsson, Viktor, Österlund, Tobias, Nerman, Olle, Kristiansson, Erik
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4727335/
https://www.ncbi.nlm.nih.gov/pubmed/26810311
http://dx.doi.org/10.1186/s12864-016-2386-y
_version_ 1782411948835471360
author Jonsson, Viktor
Österlund, Tobias
Nerman, Olle
Kristiansson, Erik
author_facet Jonsson, Viktor
Österlund, Tobias
Nerman, Olle
Kristiansson, Erik
author_sort Jonsson, Viktor
collection PubMed
description BACKGROUND: Metagenomics is the study of microbial communities by sequencing of genetic material directly from environmental or clinical samples. The genes present in the metagenomes are quantified by annotating and counting the generated DNA fragments. Identification of differentially abundant genes between metagenomes can provide important information about differences in community structure, diversity and biological function. Metagenomic data is however high-dimensional, contain high levels of biological and technical noise and have typically few biological replicates. The statistical analysis is therefore challenging and many approaches have been suggested to date. RESULTS: In this article we perform a comprehensive evaluation of 14 methods for identification of differentially abundant genes between metagenomes. The methods are compared based on the power to detect differentially abundant genes and their ability to correctly estimate the type I error rate and the false discovery rate. We show that sample size, effect size, and gene abundance greatly affect the performance of all methods. Several of the methods also show non-optimal model assumptions and biased false discovery rate estimates, which can result in too large numbers of false positives. We also demonstrate that the performance of several of the methods differs substantially between metagenomic data sequenced by different technologies. CONCLUSIONS: Two methods, primarily designed for the analysis of RNA sequencing data (edgeR and DESeq2) together with a generalized linear model based on an overdispersed Poisson distribution were found to have best overall performance. The results presented in this study may serve as a guide for selecting suitable statistical methods for identification of differentially abundant genes in metagenomes. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-016-2386-y) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-4727335
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-47273352016-01-27 Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics Jonsson, Viktor Österlund, Tobias Nerman, Olle Kristiansson, Erik BMC Genomics Research Article BACKGROUND: Metagenomics is the study of microbial communities by sequencing of genetic material directly from environmental or clinical samples. The genes present in the metagenomes are quantified by annotating and counting the generated DNA fragments. Identification of differentially abundant genes between metagenomes can provide important information about differences in community structure, diversity and biological function. Metagenomic data is however high-dimensional, contain high levels of biological and technical noise and have typically few biological replicates. The statistical analysis is therefore challenging and many approaches have been suggested to date. RESULTS: In this article we perform a comprehensive evaluation of 14 methods for identification of differentially abundant genes between metagenomes. The methods are compared based on the power to detect differentially abundant genes and their ability to correctly estimate the type I error rate and the false discovery rate. We show that sample size, effect size, and gene abundance greatly affect the performance of all methods. Several of the methods also show non-optimal model assumptions and biased false discovery rate estimates, which can result in too large numbers of false positives. We also demonstrate that the performance of several of the methods differs substantially between metagenomic data sequenced by different technologies. CONCLUSIONS: Two methods, primarily designed for the analysis of RNA sequencing data (edgeR and DESeq2) together with a generalized linear model based on an overdispersed Poisson distribution were found to have best overall performance. The results presented in this study may serve as a guide for selecting suitable statistical methods for identification of differentially abundant genes in metagenomes. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-016-2386-y) contains supplementary material, which is available to authorized users. BioMed Central 2016-01-25 /pmc/articles/PMC4727335/ /pubmed/26810311 http://dx.doi.org/10.1186/s12864-016-2386-y Text en © Jonsson et al. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Jonsson, Viktor
Österlund, Tobias
Nerman, Olle
Kristiansson, Erik
Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics
title Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics
title_full Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics
title_fullStr Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics
title_full_unstemmed Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics
title_short Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics
title_sort statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4727335/
https://www.ncbi.nlm.nih.gov/pubmed/26810311
http://dx.doi.org/10.1186/s12864-016-2386-y
work_keys_str_mv AT jonssonviktor statisticalevaluationofmethodsforidentificationofdifferentiallyabundantgenesincomparativemetagenomics
AT osterlundtobias statisticalevaluationofmethodsforidentificationofdifferentiallyabundantgenesincomparativemetagenomics
AT nermanolle statisticalevaluationofmethodsforidentificationofdifferentiallyabundantgenesincomparativemetagenomics
AT kristianssonerik statisticalevaluationofmethodsforidentificationofdifferentiallyabundantgenesincomparativemetagenomics