Cargando…

Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures

Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying group-specific sequences offers essential information for pote...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Ying, Fu, Lei, Ren, Jie, Yu, Zhaoxia, Chen, Ting, Sun, Fengzhu
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5943621/
https://www.ncbi.nlm.nih.gov/pubmed/29774017
http://dx.doi.org/10.3389/fmicb.2018.00872
_version_ 1783321665259700224
author Wang, Ying
Fu, Lei
Ren, Jie
Yu, Zhaoxia
Chen, Ting
Sun, Fengzhu
author_facet Wang, Ying
Fu, Lei
Ren, Jie
Yu, Zhaoxia
Chen, Ting
Sun, Fengzhu
author_sort Wang, Ying
collection PubMed
description Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying group-specific sequences offers essential information for potential biomarker discovery. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered “group-specific” in our study. Our main purpose is to discover group-specific sequence regions between control and case groups as disease-associated markers. We developed a long k-mer (k ≥ 30 bps)-based computational pipeline to detect group-specific sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide de novo assembly. We called our method MetaGO: Group-specific oligonucleotide analysis for metagenomic samples. An open-source pipeline on Apache Spark was developed with parallel computing. We applied MetaGO to one simulated and three real metagenomic datasets to evaluate the discriminative capability of identified group-specific markers. In the simulated dataset, 99.11% of group-specific logical 40-mers covered 98.89% disease-specific regions from the disease-associated strain. In addition, 97.90% of group-specific numerical 40-mers covered 99.61 and 96.39% of differentially abundant genome and regions between two groups, respectively. For a large-scale metagenomic liver cirrhosis (LC)-associated dataset, we identified 37,647 group-specific 40-mer features. Any one of the features can predict disease status of the training samples with the average of sensitivity and specificity higher than 0.8. The random forests classification using the top 10 group-specific features yielded a higher AUC (from ∼0.8 to ∼0.9) than that of previous studies. All group-specific 40-mers were present in LC patients, but not healthy controls. All the assembled 11 LC-specific sequences can be mapped to two strains of Veillonella parvula: UTDB1-3 and DSM2008. The experiments on the other two real datasets related to Inflammatory Bowel Disease and Type 2 Diabetes in Women consistently demonstrated that MetaGO achieved better prediction accuracy with fewer features compared to previous studies. The experiments showed that MetaGO is a powerful tool for identifying group-specific k-mers, which would be clinically applicable for disease prediction. MetaGO is available at https://github.com/VVsmileyx/MetaGO.
format Online
Article
Text
id pubmed-5943621
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-59436212018-05-17 Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures Wang, Ying Fu, Lei Ren, Jie Yu, Zhaoxia Chen, Ting Sun, Fengzhu Front Microbiol Microbiology Comparing metagenomic samples is crucial for understanding microbial communities. For different groups of microbial communities, such as human gut metagenomic samples from patients with a certain disease and healthy controls, identifying group-specific sequences offers essential information for potential biomarker discovery. A sequence that is present, or rich, in one group, but absent, or scarce, in another group is considered “group-specific” in our study. Our main purpose is to discover group-specific sequence regions between control and case groups as disease-associated markers. We developed a long k-mer (k ≥ 30 bps)-based computational pipeline to detect group-specific sequences at strain resolution free from reference sequences, sequence alignments, and metagenome-wide de novo assembly. We called our method MetaGO: Group-specific oligonucleotide analysis for metagenomic samples. An open-source pipeline on Apache Spark was developed with parallel computing. We applied MetaGO to one simulated and three real metagenomic datasets to evaluate the discriminative capability of identified group-specific markers. In the simulated dataset, 99.11% of group-specific logical 40-mers covered 98.89% disease-specific regions from the disease-associated strain. In addition, 97.90% of group-specific numerical 40-mers covered 99.61 and 96.39% of differentially abundant genome and regions between two groups, respectively. For a large-scale metagenomic liver cirrhosis (LC)-associated dataset, we identified 37,647 group-specific 40-mer features. Any one of the features can predict disease status of the training samples with the average of sensitivity and specificity higher than 0.8. The random forests classification using the top 10 group-specific features yielded a higher AUC (from ∼0.8 to ∼0.9) than that of previous studies. All group-specific 40-mers were present in LC patients, but not healthy controls. All the assembled 11 LC-specific sequences can be mapped to two strains of Veillonella parvula: UTDB1-3 and DSM2008. The experiments on the other two real datasets related to Inflammatory Bowel Disease and Type 2 Diabetes in Women consistently demonstrated that MetaGO achieved better prediction accuracy with fewer features compared to previous studies. The experiments showed that MetaGO is a powerful tool for identifying group-specific k-mers, which would be clinically applicable for disease prediction. MetaGO is available at https://github.com/VVsmileyx/MetaGO. Frontiers Media S.A. 2018-05-03 /pmc/articles/PMC5943621/ /pubmed/29774017 http://dx.doi.org/10.3389/fmicb.2018.00872 Text en Copyright © 2018 Wang, Fu, Ren, Yu, Chen and Sun. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Microbiology
Wang, Ying
Fu, Lei
Ren, Jie
Yu, Zhaoxia
Chen, Ting
Sun, Fengzhu
Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures
title Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures
title_full Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures
title_fullStr Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures
title_full_unstemmed Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures
title_short Identifying Group-Specific Sequences for Microbial Communities Using Long k-mer Sequence Signatures
title_sort identifying group-specific sequences for microbial communities using long k-mer sequence signatures
topic Microbiology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5943621/
https://www.ncbi.nlm.nih.gov/pubmed/29774017
http://dx.doi.org/10.3389/fmicb.2018.00872
work_keys_str_mv AT wangying identifyinggroupspecificsequencesformicrobialcommunitiesusinglongkmersequencesignatures
AT fulei identifyinggroupspecificsequencesformicrobialcommunitiesusinglongkmersequencesignatures
AT renjie identifyinggroupspecificsequencesformicrobialcommunitiesusinglongkmersequencesignatures
AT yuzhaoxia identifyinggroupspecificsequencesformicrobialcommunitiesusinglongkmersequencesignatures
AT chenting identifyinggroupspecificsequencesformicrobialcommunitiesusinglongkmersequencesignatures
AT sunfengzhu identifyinggroupspecificsequencesformicrobialcommunitiesusinglongkmersequencesignatures