Cargando…

alignparse: A Python package for parsing complex features from high-throughput long-read sequencing

Advances in sequencing technology have made it possible to generate large numbers of long, high-accuracy sequencing reads. For instance, the new PacBio Sequel platform can generate hundreds of thousands of high-quality circular consensus sequences in a single run (Hebert et al., 2018; Rhoads & A...

Descripción completa

Detalles Bibliográficos
Autores principales:	Crawford, Katharine H.D., Bloom, Jesse D.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	2019
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6939853/ https://www.ncbi.nlm.nih.gov/pubmed/31897449 http://dx.doi.org/10.21105/joss.01915

_version_	1783484263768784896
author	Crawford, Katharine H.D. Bloom, Jesse D.
author_facet	Crawford, Katharine H.D. Bloom, Jesse D.
author_sort	Crawford, Katharine H.D.
collection	PubMed
description	Advances in sequencing technology have made it possible to generate large numbers of long, high-accuracy sequencing reads. For instance, the new PacBio Sequel platform can generate hundreds of thousands of high-quality circular consensus sequences in a single run (Hebert et al., 2018; Rhoads & Au, 2015). Good programs exist for aligning these reads for genome assembly (Chaisson & Tesler, 2012; Li, 2018). However, these long reads can also be used for other purposes, such as sequencing PCR amplicons that contain various features of interest. For instance, PacBio circular consensus sequences have been used to identify the mutations in influenza viruses in single cells (Russell et al, 2019), or to link barcodes to gene mutants in deep mutational scanning (Matreyek et al., 2018). For such applications, the alignment of the sequences to the targets may be fairly trivial, but it is not trivial to then parse specific features of interest (such as mutations, unique molecular identifiers, cell barcodes, and flanking sequences) from these alignments. Here we describe alignparse, a Python package for parsing complex sets of features from long sequences that map to known targets. Specifically, it allows the user to provide complex target sequences in Genbank Flat File format that contain an arbitrary number of user-defined sub-sequence features (Sayers et al., 2019). It then aligns the sequencing reads to these targets and filters alignments based on whether the user-specified features are present with the desired identities (which can be set to different thresholds for different features). Finally, it parses out the sequences, mutations, and/or accuracy (sequence quality) of these features as specified by the user. The flexibility of this package therefore fulfills the need for a tool to extract and analyze complex sets of features in large numbers of long sequencing reads.
format	Online Article Text
id	pubmed-6939853
institution	National Center for Biotechnology Information
language	English
publishDate	2019
record_format	MEDLINE/PubMed
spelling	pubmed-69398532020-01-02 alignparse: A Python package for parsing complex features from high-throughput long-read sequencing Crawford, Katharine H.D. Bloom, Jesse D. J Open Source Softw Article Advances in sequencing technology have made it possible to generate large numbers of long, high-accuracy sequencing reads. For instance, the new PacBio Sequel platform can generate hundreds of thousands of high-quality circular consensus sequences in a single run (Hebert et al., 2018; Rhoads & Au, 2015). Good programs exist for aligning these reads for genome assembly (Chaisson & Tesler, 2012; Li, 2018). However, these long reads can also be used for other purposes, such as sequencing PCR amplicons that contain various features of interest. For instance, PacBio circular consensus sequences have been used to identify the mutations in influenza viruses in single cells (Russell et al, 2019), or to link barcodes to gene mutants in deep mutational scanning (Matreyek et al., 2018). For such applications, the alignment of the sequences to the targets may be fairly trivial, but it is not trivial to then parse specific features of interest (such as mutations, unique molecular identifiers, cell barcodes, and flanking sequences) from these alignments. Here we describe alignparse, a Python package for parsing complex sets of features from long sequences that map to known targets. Specifically, it allows the user to provide complex target sequences in Genbank Flat File format that contain an arbitrary number of user-defined sub-sequence features (Sayers et al., 2019). It then aligns the sequencing reads to these targets and filters alignments based on whether the user-specified features are present with the desired identities (which can be set to different thresholds for different features). Finally, it parses out the sequences, mutations, and/or accuracy (sequence quality) of these features as specified by the user. The flexibility of this package therefore fulfills the need for a tool to extract and analyze complex sets of features in large numbers of long sequencing reads. 2019-12-11 2019 /pmc/articles/PMC6939853/ /pubmed/31897449 http://dx.doi.org/10.21105/joss.01915 Text en Authors of papers retain copyright and release the work under a Creative Commons Attribution 4.0 International License (CC-BY (https://creativecommons.org/licenses/by/4.0/) ).
spellingShingle	Article Crawford, Katharine H.D. Bloom, Jesse D. alignparse: A Python package for parsing complex features from high-throughput long-read sequencing
title	alignparse: A Python package for parsing complex features from high-throughput long-read sequencing
title_full	alignparse: A Python package for parsing complex features from high-throughput long-read sequencing
title_fullStr	alignparse: A Python package for parsing complex features from high-throughput long-read sequencing
title_full_unstemmed	alignparse: A Python package for parsing complex features from high-throughput long-read sequencing
title_short	alignparse: A Python package for parsing complex features from high-throughput long-read sequencing
title_sort	alignparse: a python package for parsing complex features from high-throughput long-read sequencing
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6939853/ https://www.ncbi.nlm.nih.gov/pubmed/31897449 http://dx.doi.org/10.21105/joss.01915
work_keys_str_mv	AT crawfordkatharinehd alignparseapythonpackageforparsingcomplexfeaturesfromhighthroughputlongreadsequencing AT bloomjessed alignparseapythonpackageforparsingcomplexfeaturesfromhighthroughputlongreadsequencing

alignparse: A Python package for parsing complex features from high-throughput long-read sequencing

Ejemplares similares