Cargando…

Optimizing Read Mapping to Reference Genomes to Determine Composition and Species Prevalence in Microbial Communities

The Human Microbiome Project (HMP) aims to characterize the microbial communities of 18 body sites from healthy individuals. To accomplish this, the HMP generated two types of shotgun data: reference shotgun sequences isolated from different anatomical sites on the human body and shotgun metagenomic...

Descripción completa

Detalles Bibliográficos
Autores principales: Martin, John, Sykes, Sean, Young, Sarah, Kota, Karthik, Sanka, Ravi, Sheth, Nihar, Orvis, Joshua, Sodergren, Erica, Wang, Zhengyuan, Weinstock, George M., Mitreva, Makedonka
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3374613/
https://www.ncbi.nlm.nih.gov/pubmed/22719831
http://dx.doi.org/10.1371/journal.pone.0036427
_version_ 1782235668222574592
author Martin, John
Sykes, Sean
Young, Sarah
Kota, Karthik
Sanka, Ravi
Sheth, Nihar
Orvis, Joshua
Sodergren, Erica
Wang, Zhengyuan
Weinstock, George M.
Mitreva, Makedonka
author_facet Martin, John
Sykes, Sean
Young, Sarah
Kota, Karthik
Sanka, Ravi
Sheth, Nihar
Orvis, Joshua
Sodergren, Erica
Wang, Zhengyuan
Weinstock, George M.
Mitreva, Makedonka
author_sort Martin, John
collection PubMed
description The Human Microbiome Project (HMP) aims to characterize the microbial communities of 18 body sites from healthy individuals. To accomplish this, the HMP generated two types of shotgun data: reference shotgun sequences isolated from different anatomical sites on the human body and shotgun metagenomic sequences from the microbial communities of each site. The alignment strategy for characterizing these metagenomic communities using available reference sequence is important to the success of HMP data analysis. Six next-generation aligners were used to align a community of known composition against a database comprising reference organisms known to be present in that community. All aligners report nearly complete genome coverage (>97%) for strains with over 6X depth of coverage, however they differ in speed, memory requirement and ease of use issues such as database size limitations and supported mapping strategies. The selected aligner was tested across a range of parameters to maximize sensitivity while maintaining a low false positive rate. We found that constraining alignment length had more impact on sensitivity than does constraining similarity in all cases tested. However, when reference species were replaced with phylogenetic neighbors, similarity begins to play a larger role in detection. We also show that choosing the top hit randomly when multiple, equally strong mappings are available increases overall sensitivity at the expense of taxonomic resolution. The results of this study identified a strategy that was used to map over 3 tera-bases of microbial sequence against a database of more than 5,000 reference genomes in just over a month.
format Online
Article
Text
id pubmed-3374613
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-33746132012-06-20 Optimizing Read Mapping to Reference Genomes to Determine Composition and Species Prevalence in Microbial Communities Martin, John Sykes, Sean Young, Sarah Kota, Karthik Sanka, Ravi Sheth, Nihar Orvis, Joshua Sodergren, Erica Wang, Zhengyuan Weinstock, George M. Mitreva, Makedonka PLoS One Research Article The Human Microbiome Project (HMP) aims to characterize the microbial communities of 18 body sites from healthy individuals. To accomplish this, the HMP generated two types of shotgun data: reference shotgun sequences isolated from different anatomical sites on the human body and shotgun metagenomic sequences from the microbial communities of each site. The alignment strategy for characterizing these metagenomic communities using available reference sequence is important to the success of HMP data analysis. Six next-generation aligners were used to align a community of known composition against a database comprising reference organisms known to be present in that community. All aligners report nearly complete genome coverage (>97%) for strains with over 6X depth of coverage, however they differ in speed, memory requirement and ease of use issues such as database size limitations and supported mapping strategies. The selected aligner was tested across a range of parameters to maximize sensitivity while maintaining a low false positive rate. We found that constraining alignment length had more impact on sensitivity than does constraining similarity in all cases tested. However, when reference species were replaced with phylogenetic neighbors, similarity begins to play a larger role in detection. We also show that choosing the top hit randomly when multiple, equally strong mappings are available increases overall sensitivity at the expense of taxonomic resolution. The results of this study identified a strategy that was used to map over 3 tera-bases of microbial sequence against a database of more than 5,000 reference genomes in just over a month. Public Library of Science 2012-06-13 /pmc/articles/PMC3374613/ /pubmed/22719831 http://dx.doi.org/10.1371/journal.pone.0036427 Text en Martin et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Martin, John
Sykes, Sean
Young, Sarah
Kota, Karthik
Sanka, Ravi
Sheth, Nihar
Orvis, Joshua
Sodergren, Erica
Wang, Zhengyuan
Weinstock, George M.
Mitreva, Makedonka
Optimizing Read Mapping to Reference Genomes to Determine Composition and Species Prevalence in Microbial Communities
title Optimizing Read Mapping to Reference Genomes to Determine Composition and Species Prevalence in Microbial Communities
title_full Optimizing Read Mapping to Reference Genomes to Determine Composition and Species Prevalence in Microbial Communities
title_fullStr Optimizing Read Mapping to Reference Genomes to Determine Composition and Species Prevalence in Microbial Communities
title_full_unstemmed Optimizing Read Mapping to Reference Genomes to Determine Composition and Species Prevalence in Microbial Communities
title_short Optimizing Read Mapping to Reference Genomes to Determine Composition and Species Prevalence in Microbial Communities
title_sort optimizing read mapping to reference genomes to determine composition and species prevalence in microbial communities
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3374613/
https://www.ncbi.nlm.nih.gov/pubmed/22719831
http://dx.doi.org/10.1371/journal.pone.0036427
work_keys_str_mv AT martinjohn optimizingreadmappingtoreferencegenomestodeterminecompositionandspeciesprevalenceinmicrobialcommunities
AT sykessean optimizingreadmappingtoreferencegenomestodeterminecompositionandspeciesprevalenceinmicrobialcommunities
AT youngsarah optimizingreadmappingtoreferencegenomestodeterminecompositionandspeciesprevalenceinmicrobialcommunities
AT kotakarthik optimizingreadmappingtoreferencegenomestodeterminecompositionandspeciesprevalenceinmicrobialcommunities
AT sankaravi optimizingreadmappingtoreferencegenomestodeterminecompositionandspeciesprevalenceinmicrobialcommunities
AT shethnihar optimizingreadmappingtoreferencegenomestodeterminecompositionandspeciesprevalenceinmicrobialcommunities
AT orvisjoshua optimizingreadmappingtoreferencegenomestodeterminecompositionandspeciesprevalenceinmicrobialcommunities
AT sodergrenerica optimizingreadmappingtoreferencegenomestodeterminecompositionandspeciesprevalenceinmicrobialcommunities
AT wangzhengyuan optimizingreadmappingtoreferencegenomestodeterminecompositionandspeciesprevalenceinmicrobialcommunities
AT weinstockgeorgem optimizingreadmappingtoreferencegenomestodeterminecompositionandspeciesprevalenceinmicrobialcommunities
AT mitrevamakedonka optimizingreadmappingtoreferencegenomestodeterminecompositionandspeciesprevalenceinmicrobialcommunities