Cargando…
Estimating the total genome length of a metagenomic sample using k-mers
BACKGROUND: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of m...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456951/ https://www.ncbi.nlm.nih.gov/pubmed/30967110 http://dx.doi.org/10.1186/s12864-019-5467-x |
_version_ | 1783409833535340544 |
---|---|
author | Hua, Kui Zhang, Xuegong |
author_facet | Hua, Kui Zhang, Xuegong |
author_sort | Hua, Kui |
collection | PubMed |
description | BACKGROUND: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage. RESULTS: As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses. CONCLUSIONS: We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5467-x) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-6456951 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-64569512019-04-19 Estimating the total genome length of a metagenomic sample using k-mers Hua, Kui Zhang, Xuegong BMC Genomics Research BACKGROUND: Metagenomic sequencing is a powerful technology for studying the mixture of microbes or the microbiomes on human and in the environment. One basic task of analyzing metagenomic data is to identify the component genomes in the community. This task is challenging due to the complexity of microbiome composition, limited availability of known reference genomes, and usually insufficient sequencing coverage. RESULTS: As an initial step toward understanding the complete composition of a metagenomic sample, we studied the problem of estimating the total length of all distinct component genomes in a metagenomic sample. We showed that this problem can be solved by estimating the total number of distinct k-mers in all the metagenomic sequencing data. We proposed a method for this estimation based on the sequencing coverage distribution of observed k-mers, and introduced a k-mer redundancy index (KRI) to fill in the gap between the count of distinct k-mers and the total genome length. We showed the effectiveness of the proposed method on a set of carefully designed simulation data corresponding to multiple situations of true metagenomic data. Results on real data indicate that the uncaptured genomic information can vary dramatically across metagenomic samples, with the potential to mislead downstream analyses. CONCLUSIONS: We proposed the question of how long the total genome length of all different species in a microbial community is and introduced a method to answer it. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12864-019-5467-x) contains supplementary material, which is available to authorized users. BioMed Central 2019-04-04 /pmc/articles/PMC6456951/ /pubmed/30967110 http://dx.doi.org/10.1186/s12864-019-5467-x Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Hua, Kui Zhang, Xuegong Estimating the total genome length of a metagenomic sample using k-mers |
title | Estimating the total genome length of a metagenomic sample using k-mers |
title_full | Estimating the total genome length of a metagenomic sample using k-mers |
title_fullStr | Estimating the total genome length of a metagenomic sample using k-mers |
title_full_unstemmed | Estimating the total genome length of a metagenomic sample using k-mers |
title_short | Estimating the total genome length of a metagenomic sample using k-mers |
title_sort | estimating the total genome length of a metagenomic sample using k-mers |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6456951/ https://www.ncbi.nlm.nih.gov/pubmed/30967110 http://dx.doi.org/10.1186/s12864-019-5467-x |
work_keys_str_mv | AT huakui estimatingthetotalgenomelengthofametagenomicsampleusingkmers AT zhangxuegong estimatingthetotalgenomelengthofametagenomicsampleusingkmers |