Cargando…

Evaluation of methods for detecting human reads in microbial sequencing datasets

Sequencing data from host-associated microbes can often be contaminated by the body of the investigator or research subject. Human DNA is typically removed from microbial reads either by subtractive alignment (dropping all reads that map to the human genome) or by using a read classification tool to...

Descripción completa

Detalles Bibliográficos
Autores principales: Bush, Stephen J., Connor, Thomas R., Peto, Tim E.A., Crook, Derrick W., Walker, A. Sarah
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Microbiology Society 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7478626/
https://www.ncbi.nlm.nih.gov/pubmed/32558637
http://dx.doi.org/10.1099/mgen.0.000393
_version_ 1783580094310121472
author Bush, Stephen J.
Connor, Thomas R.
Peto, Tim E.A.
Crook, Derrick W.
Walker, A. Sarah
author_facet Bush, Stephen J.
Connor, Thomas R.
Peto, Tim E.A.
Crook, Derrick W.
Walker, A. Sarah
author_sort Bush, Stephen J.
collection PubMed
description Sequencing data from host-associated microbes can often be contaminated by the body of the investigator or research subject. Human DNA is typically removed from microbial reads either by subtractive alignment (dropping all reads that map to the human genome) or by using a read classification tool to predict those of human origin, and then discarding them. To inform best practice guidelines, we benchmarked eight alignment-based and two classification-based methods of human read detection using simulated data from 10 clinically prevalent bacteria and three viruses, into which contaminating human reads had been added. While the majority of methods successfully detected >99 % of the human reads, they were distinguishable by variance. The most precise methods, with negligible variance, were Bowtie2 and SNAP, both of which misidentified few, if any, bacterial reads (and no viral reads) as human. While correctly detecting a similar number of human reads, methods based on taxonomic classification, such as Kraken2 and Centrifuge, could misclassify bacterial reads as human, although the extent of this was species-specific. Among the most sensitive methods of human read detection was BWA, although this also made the greatest number of false positive classifications. Across all methods, the set of human reads not identified as such, although often representing <0.1 % of the total reads, were non-randomly distributed along the human genome with many originating from the repeat-rich sex chromosomes. For viral reads and longer (>300 bp) bacterial reads, the highest performing approaches were classification-based, using Kraken2 or Centrifuge. For shorter (c. 150 bp) bacterial reads, combining multiple methods of human read detection maximized the recovery of human reads from contaminated short read datasets without being compromised by false positives. A particularly high-performance approach with shorter bacterial reads was a two-stage classification using Bowtie2 followed by SNAP. Using this approach, we re-examined 11 577 publicly archived bacterial read sets for hitherto undetected human contamination. We were able to extract a sufficient number of reads to call known human SNPs, including those with clinical significance, in 6 % of the samples. These results show that phenotypically distinct human sequence is detectable in publicly archived microbial read datasets.
format Online
Article
Text
id pubmed-7478626
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Microbiology Society
record_format MEDLINE/PubMed
spelling pubmed-74786262020-09-09 Evaluation of methods for detecting human reads in microbial sequencing datasets Bush, Stephen J. Connor, Thomas R. Peto, Tim E.A. Crook, Derrick W. Walker, A. Sarah Microb Genom Research Article Sequencing data from host-associated microbes can often be contaminated by the body of the investigator or research subject. Human DNA is typically removed from microbial reads either by subtractive alignment (dropping all reads that map to the human genome) or by using a read classification tool to predict those of human origin, and then discarding them. To inform best practice guidelines, we benchmarked eight alignment-based and two classification-based methods of human read detection using simulated data from 10 clinically prevalent bacteria and three viruses, into which contaminating human reads had been added. While the majority of methods successfully detected >99 % of the human reads, they were distinguishable by variance. The most precise methods, with negligible variance, were Bowtie2 and SNAP, both of which misidentified few, if any, bacterial reads (and no viral reads) as human. While correctly detecting a similar number of human reads, methods based on taxonomic classification, such as Kraken2 and Centrifuge, could misclassify bacterial reads as human, although the extent of this was species-specific. Among the most sensitive methods of human read detection was BWA, although this also made the greatest number of false positive classifications. Across all methods, the set of human reads not identified as such, although often representing <0.1 % of the total reads, were non-randomly distributed along the human genome with many originating from the repeat-rich sex chromosomes. For viral reads and longer (>300 bp) bacterial reads, the highest performing approaches were classification-based, using Kraken2 or Centrifuge. For shorter (c. 150 bp) bacterial reads, combining multiple methods of human read detection maximized the recovery of human reads from contaminated short read datasets without being compromised by false positives. A particularly high-performance approach with shorter bacterial reads was a two-stage classification using Bowtie2 followed by SNAP. Using this approach, we re-examined 11 577 publicly archived bacterial read sets for hitherto undetected human contamination. We were able to extract a sufficient number of reads to call known human SNPs, including those with clinical significance, in 6 % of the samples. These results show that phenotypically distinct human sequence is detectable in publicly archived microbial read datasets. Microbiology Society 2020-06-19 /pmc/articles/PMC7478626/ /pubmed/32558637 http://dx.doi.org/10.1099/mgen.0.000393 Text en © 2020 The Authors http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.
spellingShingle Research Article
Bush, Stephen J.
Connor, Thomas R.
Peto, Tim E.A.
Crook, Derrick W.
Walker, A. Sarah
Evaluation of methods for detecting human reads in microbial sequencing datasets
title Evaluation of methods for detecting human reads in microbial sequencing datasets
title_full Evaluation of methods for detecting human reads in microbial sequencing datasets
title_fullStr Evaluation of methods for detecting human reads in microbial sequencing datasets
title_full_unstemmed Evaluation of methods for detecting human reads in microbial sequencing datasets
title_short Evaluation of methods for detecting human reads in microbial sequencing datasets
title_sort evaluation of methods for detecting human reads in microbial sequencing datasets
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7478626/
https://www.ncbi.nlm.nih.gov/pubmed/32558637
http://dx.doi.org/10.1099/mgen.0.000393
work_keys_str_mv AT bushstephenj evaluationofmethodsfordetectinghumanreadsinmicrobialsequencingdatasets
AT connorthomasr evaluationofmethodsfordetectinghumanreadsinmicrobialsequencingdatasets
AT petotimea evaluationofmethodsfordetectinghumanreadsinmicrobialsequencingdatasets
AT crookderrickw evaluationofmethodsfordetectinghumanreadsinmicrobialsequencingdatasets
AT walkerasarah evaluationofmethodsfordetectinghumanreadsinmicrobialsequencingdatasets