Cargando…

Estimating the total genome length of a metagenomic sample using k-mers

BACKGROUND: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of m...

Descripción completa

Detalles Bibliográficos
Autores principales: Hua, Kui, Zhang, Xuegong
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456951/
https://www.ncbi.nlm.nih.gov/pubmed/30967110
http://dx.doi.org/10.1186/s12864-019-5467-x
_version_ 1783409833535340544
author Hua, Kui
Zhang, Xuegong
author_facet Hua, Kui
Zhang, Xuegong
author_sort Hua, Kui
collection PubMed
description BACKGROUND: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage. RESULTS: As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses. CONCLUSIONS: We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5467-x) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6456951
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-64569512019-04-19 Estimating the total genome length of a metagenomic sample using k-mers Hua, Kui Zhang, Xuegong BMC Genomics Research BACKGROUND: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage. RESULTS: As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses. CONCLUSIONS: We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5467-x) contains supplementary material, which is available to authorized users. BioMed Central 2019-04-04 /pmc/articles/PMC6456951/ /pubmed/30967110 http://dx.doi.org/10.1186/s12864-019-5467-x Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Hua, Kui
Zhang, Xuegong
Estimating the total genome length of a metagenomic sample using k-mers
title Estimating the total genome length of a metagenomic sample using k-mers
title_full Estimating the total genome length of a metagenomic sample using k-mers
title_fullStr Estimating the total genome length of a metagenomic sample using k-mers
title_full_unstemmed Estimating the total genome length of a metagenomic sample using k-mers
title_short Estimating the total genome length of a metagenomic sample using k-mers
title_sort estimating the total genome length of a metagenomic sample using k-mers
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456951/
https://www.ncbi.nlm.nih.gov/pubmed/30967110
http://dx.doi.org/10.1186/s12864-019-5467-x
work_keys_str_mv AT huakui estimatingthetotalgenomelengthofametagenomicsampleusingkmers
AT zhangxuegong estimatingthetotalgenomelengthofametagenomicsampleusingkmers