Cargando…

SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines

BACKGROUND: Human tissue is increasingly being whole genome sequenced as we transition into an era of genomic medicine. With this arises the potential to detect sequences originating from microorganisms, including pathogens amid the plethora of human sequencing reads. In cancer research, the tumorig...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gihawi, Abraham, Rallapalli, Ghanasyam, Hurst, Rachel, Cooper, Colin S., Leggett, Richard M., Brewer, Daniel S.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6805339/ https://www.ncbi.nlm.nih.gov/pubmed/31639030 http://dx.doi.org/10.1186/s13059-019-1819-8

_version_	1783461359640379392
author	Gihawi, Abraham Rallapalli, Ghanasyam Hurst, Rachel Cooper, Colin S. Leggett, Richard M. Brewer, Daniel S.
author_facet	Gihawi, Abraham Rallapalli, Ghanasyam Hurst, Rachel Cooper, Colin S. Leggett, Richard M. Brewer, Daniel S.
author_sort	Gihawi, Abraham
collection	PubMed
description	BACKGROUND: Human tissue is increasingly being whole genome sequenced as we transition into an era of genomic medicine. With this arises the potential to detect sequences originating from microorganisms, including pathogens amid the plethora of human sequencing reads. In cancer research, the tumorigenic ability of pathogens is being recognized, for example, Helicobacter pylori and human papillomavirus in the cases of gastric non-cardia and cervical carcinomas, respectively. As of yet, no benchmark has been carried out on the performance of computational approaches for bacterial and viral detection within host-dominated sequence data. RESULTS: We present the results of benchmarking over 70 distinct combinations of tools and parameters on 100 simulated cancer datasets spiked with realistic proportions of bacteria. mOTUs2 and Kraken are the highest performing individual tools achieving median genus-level F1 scores of 0.90 and 0.91, respectively. mOTUs2 demonstrates a high performance in estimating bacterial proportions. Employing Kraken on unassembled sequencing reads produces a good but variable performance depending on post-classification filtering parameters. These approaches are investigated on a selection of cervical and gastric cancer whole genome sequences where Alphapapillomavirus and Helicobacter are detected in addition to a variety of other interesting genera. CONCLUSIONS: We provide the top-performing pipelines from this benchmark in a unifying tool called SEPATH, which is amenable to high throughput sequencing studies across a range of high-performance computing clusters. SEPATH provides a benchmarked and convenient approach to detect pathogens in tissue sequence data helping to determine the relationship between metagenomics and disease.
format	Online Article Text
id	pubmed-6805339
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-68053392019-10-24 SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines Gihawi, Abraham Rallapalli, Ghanasyam Hurst, Rachel Cooper, Colin S. Leggett, Richard M. Brewer, Daniel S. Genome Biol Research BACKGROUND: Human tissue is increasingly being whole genome sequenced as we transition into an era of genomic medicine. With this arises the potential to detect sequences originating from microorganisms, including pathogens amid the plethora of human sequencing reads. In cancer research, the tumorigenic ability of pathogens is being recognized, for example, Helicobacter pylori and human papillomavirus in the cases of gastric non-cardia and cervical carcinomas, respectively. As of yet, no benchmark has been carried out on the performance of computational approaches for bacterial and viral detection within host-dominated sequence data. RESULTS: We present the results of benchmarking over 70 distinct combinations of tools and parameters on 100 simulated cancer datasets spiked with realistic proportions of bacteria. mOTUs2 and Kraken are the highest performing individual tools achieving median genus-level F1 scores of 0.90 and 0.91, respectively. mOTUs2 demonstrates a high performance in estimating bacterial proportions. Employing Kraken on unassembled sequencing reads produces a good but variable performance depending on post-classification filtering parameters. These approaches are investigated on a selection of cervical and gastric cancer whole genome sequences where Alphapapillomavirus and Helicobacter are detected in addition to a variety of other interesting genera. CONCLUSIONS: We provide the top-performing pipelines from this benchmark in a unifying tool called SEPATH, which is amenable to high throughput sequencing studies across a range of high-performance computing clusters. SEPATH provides a benchmarked and convenient approach to detect pathogens in tissue sequence data helping to determine the relationship between metagenomics and disease. BioMed Central 2019-10-22 /pmc/articles/PMC6805339/ /pubmed/31639030 http://dx.doi.org/10.1186/s13059-019-1819-8 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Gihawi, Abraham Rallapalli, Ghanasyam Hurst, Rachel Cooper, Colin S. Leggett, Richard M. Brewer, Daniel S. SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines
title	SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines
title_full	SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines
title_fullStr	SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines
title_full_unstemmed	SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines
title_short	SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines
title_sort	sepath: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6805339/ https://www.ncbi.nlm.nih.gov/pubmed/31639030 http://dx.doi.org/10.1186/s13059-019-1819-8
work_keys_str_mv	AT gihawiabraham sepathbenchmarkingthesearchforpathogensinhumantissuewholegenomesequencedataleadstotemplatepipelines AT rallapallighanasyam sepathbenchmarkingthesearchforpathogensinhumantissuewholegenomesequencedataleadstotemplatepipelines AT hurstrachel sepathbenchmarkingthesearchforpathogensinhumantissuewholegenomesequencedataleadstotemplatepipelines AT coopercolins sepathbenchmarkingthesearchforpathogensinhumantissuewholegenomesequencedataleadstotemplatepipelines AT leggettrichardm sepathbenchmarkingthesearchforpathogensinhumantissuewholegenomesequencedataleadstotemplatepipelines AT brewerdaniels sepathbenchmarkingthesearchforpathogensinhumantissuewholegenomesequencedataleadstotemplatepipelines

SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines

Ejemplares similares