Cargando…

Estimating the total genome length of a metagenomic sample using k-mers

BACKGROUND: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of m...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hua, Kui, Zhang, Xuegong
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456951/ https://www.ncbi.nlm.nih.gov/pubmed/30967110 http://dx.doi.org/10.1186/s12864-019-5467-x

_version_	1783409833535340544
author	Hua, Kui Zhang, Xuegong
author_facet	Hua, Kui Zhang, Xuegong
author_sort	Hua, Kui
collection	PubMed
description	BACKGROUND: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage. RESULTS: As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses. CONCLUSIONS: We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5467-x) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6456951
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-64569512019-04-19 Estimating the total genome length of a metagenomic sample using k-mers Hua, Kui Zhang, Xuegong BMC Genomics Research BACKGROUND: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage. RESULTS: As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses. CONCLUSIONS: We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5467-x) contains supplementary material, which is available to authorized users. BioMed Central 2019-04-04 /pmc/articles/PMC6456951/ /pubmed/30967110 http://dx.doi.org/10.1186/s12864-019-5467-x Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Hua, Kui Zhang, Xuegong Estimating the total genome length of a metagenomic sample using k-mers
title	Estimating the total genome length of a metagenomic sample using k-mers
title_full	Estimating the total genome length of a metagenomic sample using k-mers
title_fullStr	Estimating the total genome length of a metagenomic sample using k-mers
title_full_unstemmed	Estimating the total genome length of a metagenomic sample using k-mers
title_short	Estimating the total genome length of a metagenomic sample using k-mers
title_sort	estimating the total genome length of a metagenomic sample using k-mers
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456951/ https://www.ncbi.nlm.nih.gov/pubmed/30967110 http://dx.doi.org/10.1186/s12864-019-5467-x
work_keys_str_mv	AT huakui estimatingthetotalgenomelengthofametagenomicsampleusingkmers AT zhangxuegong estimatingthetotalgenomelengthofametagenomicsampleusingkmers

Estimating the total genome length of a metagenomic sample using k-mers

Ejemplares similares