Cargando…

Population-scale detection of non-reference sequence variants using colored de Bruijn graphs

MOTIVATION: With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of de...

Descripción completa

Detalles Bibliográficos
Autores principales: Krannich, Thomas, White, W Timothy J, Niehus, Sebastian, Holley, Guillaume, Halldórsson, Bjarni V, Kehr, Birte
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8756200/
https://www.ncbi.nlm.nih.gov/pubmed/34726732
http://dx.doi.org/10.1093/bioinformatics/btab749
_version_ 1784632517121277952
author Krannich, Thomas
White, W Timothy J
Niehus, Sebastian
Holley, Guillaume
Halldórsson, Bjarni V
Kehr, Birte
author_facet Krannich, Thomas
White, W Timothy J
Niehus, Sebastian
Holley, Guillaume
Halldórsson, Bjarni V
Kehr, Birte
author_sort Krannich, Thomas
collection PubMed
description MOTIVATION: With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes. RESULTS: We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets. AVAILABILITY AND IMPLEMENTATION: The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8756200
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-87562002022-01-13 Population-scale detection of non-reference sequence variants using colored de Bruijn graphs Krannich, Thomas White, W Timothy J Niehus, Sebastian Holley, Guillaume Halldórsson, Bjarni V Kehr, Birte Bioinformatics Original Papers MOTIVATION: With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes. RESULTS: We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets. AVAILABILITY AND IMPLEMENTATION: The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-11-02 /pmc/articles/PMC8756200/ /pubmed/34726732 http://dx.doi.org/10.1093/bioinformatics/btab749 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Krannich, Thomas
White, W Timothy J
Niehus, Sebastian
Holley, Guillaume
Halldórsson, Bjarni V
Kehr, Birte
Population-scale detection of non-reference sequence variants using colored de Bruijn graphs
title Population-scale detection of non-reference sequence variants using colored de Bruijn graphs
title_full Population-scale detection of non-reference sequence variants using colored de Bruijn graphs
title_fullStr Population-scale detection of non-reference sequence variants using colored de Bruijn graphs
title_full_unstemmed Population-scale detection of non-reference sequence variants using colored de Bruijn graphs
title_short Population-scale detection of non-reference sequence variants using colored de Bruijn graphs
title_sort population-scale detection of non-reference sequence variants using colored de bruijn graphs
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8756200/
https://www.ncbi.nlm.nih.gov/pubmed/34726732
http://dx.doi.org/10.1093/bioinformatics/btab749
work_keys_str_mv AT krannichthomas populationscaledetectionofnonreferencesequencevariantsusingcoloreddebruijngraphs
AT whitewtimothyj populationscaledetectionofnonreferencesequencevariantsusingcoloreddebruijngraphs
AT niehussebastian populationscaledetectionofnonreferencesequencevariantsusingcoloreddebruijngraphs
AT holleyguillaume populationscaledetectionofnonreferencesequencevariantsusingcoloreddebruijngraphs
AT halldorssonbjarniv populationscaledetectionofnonreferencesequencevariantsusingcoloreddebruijngraphs
AT kehrbirte populationscaledetectionofnonreferencesequencevariantsusingcoloreddebruijngraphs