Cargando…

A pipeline for local assembly of minisatellite alleles from single-molecule sequencing data

MOTIVATION: The advent of Next Generation Sequencing (NGS) has led to the generation of enormous volumes of short read sequence data, cheaply and in reasonable time scales. Nevertheless, the quality of genome assemblies generated using NGS technologies has been greatly affected, compared to those ge...

Descripción completa

Detalles Bibliográficos
Autores principales: Ogeh, Denye, Badge, Richard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408865/
https://www.ncbi.nlm.nih.gov/pubmed/27998939
http://dx.doi.org/10.1093/bioinformatics/btw687
_version_ 1783232379218821120
author Ogeh, Denye
Badge, Richard
author_facet Ogeh, Denye
Badge, Richard
author_sort Ogeh, Denye
collection PubMed
description MOTIVATION: The advent of Next Generation Sequencing (NGS) has led to the generation of enormous volumes of short read sequence data, cheaply and in reasonable time scales. Nevertheless, the quality of genome assemblies generated using NGS technologies has been greatly affected, compared to those generated using Sanger DNA sequencing. This is largely due to the inability of short read sequence data to scaffold repetitive structures, creating gaps, inversions and rearrangements and resulting in assemblies that are, at best, draft forms. Third generation single-molecule sequencing (SMS) technologies (e.g. Pacific Biosciences Single Molecule Real Time (SMRT) system) address this challenge by generating sequences with increased read lengths, offering the prospect to better recover these complex repetitive structures, concomitantly improving assembly quality. RESULTS: Here, we evaluate the ability of SMS data (specifically human genome Pacific Biosciences SMRT data) to recover poorly represented repetitive sequences (specifically, GC-rich human minisatellites). To do this we designed a pipeline for the collection, processing and local assembly of single-molecule sequence data to form accurate contiguous local reconstructions. Our results show the recovery of an allele of the non-coding minisatellite MS1 (located on chromosome 1 at 1p33-35) at greater than 97% identity to reference (GRCh38) from the unprocessed sequence data of a haploid complete hydatidiform mole (CHM1) cell line. Furthermore, our assembly revealed an allele of over 500 repeat units; much larger than the reference (GRCh38), but consistent in structure with naturally occurring alleles that are segregating in human populations. This local assembly’s reconstruction was validated with the release of the whole genome assemblies GCA_001297185.1 and GCA_000772585.3, where this allele occurs. Additionally, application of this pipeline to coding minisatellites in the PRDM9 and ZNF93 genes enabled recovery of high identity allele structures for these sequence regions whose length was confirmed by PCR from cell line genomic DNA. The internal repeat structure of the PRDM9 allele recovered was consistent with common human-specific alleles. AVAILABILITY AND IMPLEMENTATION: Code available at https://github.com/ndliberial/smrt_pipeline
format Online
Article
Text
id pubmed-5408865
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-54088652017-05-03 A pipeline for local assembly of minisatellite alleles from single-molecule sequencing data Ogeh, Denye Badge, Richard Bioinformatics Original Papers MOTIVATION: The advent of Next Generation Sequencing (NGS) has led to the generation of enormous volumes of short read sequence data, cheaply and in reasonable time scales. Nevertheless, the quality of genome assemblies generated using NGS technologies has been greatly affected, compared to those generated using Sanger DNA sequencing. This is largely due to the inability of short read sequence data to scaffold repetitive structures, creating gaps, inversions and rearrangements and resulting in assemblies that are, at best, draft forms. Third generation single-molecule sequencing (SMS) technologies (e.g. Pacific Biosciences Single Molecule Real Time (SMRT) system) address this challenge by generating sequences with increased read lengths, offering the prospect to better recover these complex repetitive structures, concomitantly improving assembly quality. RESULTS: Here, we evaluate the ability of SMS data (specifically human genome Pacific Biosciences SMRT data) to recover poorly represented repetitive sequences (specifically, GC-rich human minisatellites). To do this we designed a pipeline for the collection, processing and local assembly of single-molecule sequence data to form accurate contiguous local reconstructions. Our results show the recovery of an allele of the non-coding minisatellite MS1 (located on chromosome 1 at 1p33-35) at greater than 97% identity to reference (GRCh38) from the unprocessed sequence data of a haploid complete hydatidiform mole (CHM1) cell line. Furthermore, our assembly revealed an allele of over 500 repeat units; much larger than the reference (GRCh38), but consistent in structure with naturally occurring alleles that are segregating in human populations. This local assembly’s reconstruction was validated with the release of the whole genome assemblies GCA_001297185.1 and GCA_000772585.3, where this allele occurs. Additionally, application of this pipeline to coding minisatellites in the PRDM9 and ZNF93 genes enabled recovery of high identity allele structures for these sequence regions whose length was confirmed by PCR from cell line genomic DNA. The internal repeat structure of the PRDM9 allele recovered was consistent with common human-specific alleles. AVAILABILITY AND IMPLEMENTATION: Code available at https://github.com/ndliberial/smrt_pipeline Oxford University Press 2017-03-01 2016-12-05 /pmc/articles/PMC5408865/ /pubmed/27998939 http://dx.doi.org/10.1093/bioinformatics/btw687 Text en © The Author 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Ogeh, Denye
Badge, Richard
A pipeline for local assembly of minisatellite alleles from single-molecule sequencing data
title A pipeline for local assembly of minisatellite alleles from single-molecule sequencing data
title_full A pipeline for local assembly of minisatellite alleles from single-molecule sequencing data
title_fullStr A pipeline for local assembly of minisatellite alleles from single-molecule sequencing data
title_full_unstemmed A pipeline for local assembly of minisatellite alleles from single-molecule sequencing data
title_short A pipeline for local assembly of minisatellite alleles from single-molecule sequencing data
title_sort pipeline for local assembly of minisatellite alleles from single-molecule sequencing data
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408865/
https://www.ncbi.nlm.nih.gov/pubmed/27998939
http://dx.doi.org/10.1093/bioinformatics/btw687
work_keys_str_mv AT ogehdenye apipelineforlocalassemblyofminisatelliteallelesfromsinglemoleculesequencingdata
AT badgerichard apipelineforlocalassemblyofminisatelliteallelesfromsinglemoleculesequencingdata
AT ogehdenye pipelineforlocalassemblyofminisatelliteallelesfromsinglemoleculesequencingdata
AT badgerichard pipelineforlocalassemblyofminisatelliteallelesfromsinglemoleculesequencingdata