Cargando…

scanPAV: a pipeline for extracting presence–absence variations in genome pairs

MOTIVATION: The recent technological advances in genome sequencing techniques have resulted in an exponential increase in the number of sequenced human and non-human genomes. The ever increasing number of assemblies generated by novel de novo pipelines and strategies demands the development of new s...

Descripción completa

Detalles Bibliográficos
Autores principales: Giordano, Francesca, Stammnitz, Maximilian R, Murchison, Elizabeth P, Ning, Zemin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6129304/
https://www.ncbi.nlm.nih.gov/pubmed/29608694
http://dx.doi.org/10.1093/bioinformatics/bty189
Descripción
Sumario:MOTIVATION: The recent technological advances in genome sequencing techniques have resulted in an exponential increase in the number of sequenced human and non-human genomes. The ever increasing number of assemblies generated by novel de novo pipelines and strategies demands the development of new software to evaluate assembly quality and completeness. One way to determine the completeness of an assembly is by detecting its Presence–Absence variations (PAV) with respect to a reference, where PAVs between two assemblies are defined as the sequences present in one assembly but entirely missing in the other one. Beyond assembly error or technology bias, PAVs can also reveal real genome polymorphism, consequence of species or individual evolution, or horizontal transfer from viruses and bacteria. RESULTS: We present scanPAV, a pipeline for pairwise assembly comparison to identify and extract sequences present in one assembly but not the other. In this note, we use the GRCh38 reference assembly to assess the completeness of six human genome assemblies from various assembly strategies and sequencing technologies including Illumina short reads, 10× genomics linked-reads, PacBio and Oxford Nanopore long reads, and Bionano optical maps. We also discuss the PAV polymorphism of seven Tasmanian devil whole genome assemblies of normal animal tissues and devil facial tumour 1 (DFT1) and 2 (DFT2) samples, and the identification of bacterial sequences as contamination in some of the tumorous assemblies. AVAILABILITY AND IMPLEMENTATION: The pipeline is available under the MIT License at https://github.com/wtsi-hpag/scanPAV. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.