Cargando…

Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper

Across the life sciences, processing next generation sequencing data commonly relies upon a computationally expensive process where reads are mapped onto a reference sequence. Prior to such processing, however, there is a vast amount of information that can be ascertained from the reads, potentially...

Descripción completa

Detalles Bibliográficos
Autores principales: Richmond, Phillip Andrew, Kaye, Alice Mary, Kounkou, Godfrain Jacques, Av-Shalom, Tamar Vered, Wasserman, Wyeth W.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8016220/
https://www.ncbi.nlm.nih.gov/pubmed/33750951
http://dx.doi.org/10.1371/journal.pcbi.1008815
_version_ 1783673812628275200
author Richmond, Phillip Andrew
Kaye, Alice Mary
Kounkou, Godfrain Jacques
Av-Shalom, Tamar Vered
Wasserman, Wyeth W.
author_facet Richmond, Phillip Andrew
Kaye, Alice Mary
Kounkou, Godfrain Jacques
Av-Shalom, Tamar Vered
Wasserman, Wyeth W.
author_sort Richmond, Phillip Andrew
collection PubMed
description Across the life sciences, processing next generation sequencing data commonly relies upon a computationally expensive process where reads are mapped onto a reference sequence. Prior to such processing, however, there is a vast amount of information that can be ascertained from the reads, potentially obviating the need for processing, or allowing optimized mapping approaches to be deployed. Here, we present a method termed FlexTyper which facilitates a “reverse mapping” approach in which high throughput sequence queries, in the form of k-mer searches, are run against indexed short-read datasets in order to extract useful information. This reverse mapping approach enables the rapid counting of target sequences of interest. We demonstrate FlexTyper’s utility for recovering depth of coverage, and accurate genotyping of SNP sites across the human genome. We show that genotyping unmapped reads can correctly inform a sample’s population, sex, and relatedness in a family setting. Detection of pathogen sequences within RNA-seq data was sensitive and accurate, performing comparably to existing methods, but with increased flexibility. We present two examples of ways in which this flexibility allows the analysis of genome features not well-represented in a linear reference. First, we analyze contigs from African genome sequencing studies, showing how they distribute across families from three distinct populations. Second, we show how gene-marking k-mers for the killer immune receptor locus allow allele detection in a region that is challenging for standard read mapping pipelines. The future adoption of the reverse mapping approach represented by FlexTyper will be enabled by more efficient methods for FM-index generation and biology-informed collections of reference queries. In the long-term, selection of population-specific references or weighting of edges in pan-population reference genome graphs will be possible using the FlexTyper approach. FlexTyper is available at https://github.com/wassermanlab/OpenFlexTyper.
format Online
Article
Text
id pubmed-8016220
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-80162202021-04-08 Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper Richmond, Phillip Andrew Kaye, Alice Mary Kounkou, Godfrain Jacques Av-Shalom, Tamar Vered Wasserman, Wyeth W. PLoS Comput Biol Research Article Across the life sciences, processing next generation sequencing data commonly relies upon a computationally expensive process where reads are mapped onto a reference sequence. Prior to such processing, however, there is a vast amount of information that can be ascertained from the reads, potentially obviating the need for processing, or allowing optimized mapping approaches to be deployed. Here, we present a method termed FlexTyper which facilitates a “reverse mapping” approach in which high throughput sequence queries, in the form of k-mer searches, are run against indexed short-read datasets in order to extract useful information. This reverse mapping approach enables the rapid counting of target sequences of interest. We demonstrate FlexTyper’s utility for recovering depth of coverage, and accurate genotyping of SNP sites across the human genome. We show that genotyping unmapped reads can correctly inform a sample’s population, sex, and relatedness in a family setting. Detection of pathogen sequences within RNA-seq data was sensitive and accurate, performing comparably to existing methods, but with increased flexibility. We present two examples of ways in which this flexibility allows the analysis of genome features not well-represented in a linear reference. First, we analyze contigs from African genome sequencing studies, showing how they distribute across families from three distinct populations. Second, we show how gene-marking k-mers for the killer immune receptor locus allow allele detection in a region that is challenging for standard read mapping pipelines. The future adoption of the reverse mapping approach represented by FlexTyper will be enabled by more efficient methods for FM-index generation and biology-informed collections of reference queries. In the long-term, selection of population-specific references or weighting of edges in pan-population reference genome graphs will be possible using the FlexTyper approach. FlexTyper is available at https://github.com/wassermanlab/OpenFlexTyper. Public Library of Science 2021-03-22 /pmc/articles/PMC8016220/ /pubmed/33750951 http://dx.doi.org/10.1371/journal.pcbi.1008815 Text en © 2021 Richmond et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Richmond, Phillip Andrew
Kaye, Alice Mary
Kounkou, Godfrain Jacques
Av-Shalom, Tamar Vered
Wasserman, Wyeth W.
Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper
title Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper
title_full Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper
title_fullStr Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper
title_full_unstemmed Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper
title_short Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper
title_sort demonstrating the utility of flexible sequence queries against indexed short reads with flextyper
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8016220/
https://www.ncbi.nlm.nih.gov/pubmed/33750951
http://dx.doi.org/10.1371/journal.pcbi.1008815
work_keys_str_mv AT richmondphillipandrew demonstratingtheutilityofflexiblesequencequeriesagainstindexedshortreadswithflextyper
AT kayealicemary demonstratingtheutilityofflexiblesequencequeriesagainstindexedshortreadswithflextyper
AT kounkougodfrainjacques demonstratingtheutilityofflexiblesequencequeriesagainstindexedshortreadswithflextyper
AT avshalomtamarvered demonstratingtheutilityofflexiblesequencequeriesagainstindexedshortreadswithflextyper
AT wassermanwyethw demonstratingtheutilityofflexiblesequencequeriesagainstindexedshortreadswithflextyper