Cargando…

Towards a better understanding of the low recall of insertion variants with short-read based variant callers

BACKGROUND: Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has comple...

Descripción completa

Detalles Bibliográficos
Autores principales: Delage, Wesley J., Thevenon, Julien, Lemaitre, Claire
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7640490/
https://www.ncbi.nlm.nih.gov/pubmed/33148192
http://dx.doi.org/10.1186/s12864-020-07125-5
_version_ 1783605756407316480
author Delage, Wesley J.
Thevenon, Julien
Lemaitre, Claire
author_facet Delage, Wesley J.
Thevenon, Julien
Lemaitre, Claire
author_sort Delage, Wesley J.
collection PubMed
description BACKGROUND: Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 28% could be discovered with short-read based tools. RESULTS: In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several structural variant callers. We showed that most reported insertions exhibited characteristics that may interfere with their discovery: 63% were tandem repeat expansions, 38% contained homology larger than 10 bp within their breakpoint junctions and 70% were located in simple repeats. Consequently, the recall of short-read based variant callers was significantly lower for such insertions (6% for tandem repeats vs 56% for mobile element insertions). Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested structural variant callers, and they highlighted the lack of sequence resolution for most insertion calls. CONCLUSIONS: Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their combinations. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (doi:10.1186/s12864-020-07125-5).
format Online
Article
Text
id pubmed-7640490
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-76404902020-11-04 Towards a better understanding of the low recall of insertion variants with short-read based variant callers Delage, Wesley J. Thevenon, Julien Lemaitre, Claire BMC Genomics Research Article BACKGROUND: Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 28% could be discovered with short-read based tools. RESULTS: In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several structural variant callers. We showed that most reported insertions exhibited characteristics that may interfere with their discovery: 63% were tandem repeat expansions, 38% contained homology larger than 10 bp within their breakpoint junctions and 70% were located in simple repeats. Consequently, the recall of short-read based variant callers was significantly lower for such insertions (6% for tandem repeats vs 56% for mobile element insertions). Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested structural variant callers, and they highlighted the lack of sequence resolution for most insertion calls. CONCLUSIONS: Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their combinations. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (doi:10.1186/s12864-020-07125-5). BioMed Central 2020-11-04 /pmc/articles/PMC7640490/ /pubmed/33148192 http://dx.doi.org/10.1186/s12864-020-07125-5 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research Article
Delage, Wesley J.
Thevenon, Julien
Lemaitre, Claire
Towards a better understanding of the low recall of insertion variants with short-read based variant callers
title Towards a better understanding of the low recall of insertion variants with short-read based variant callers
title_full Towards a better understanding of the low recall of insertion variants with short-read based variant callers
title_fullStr Towards a better understanding of the low recall of insertion variants with short-read based variant callers
title_full_unstemmed Towards a better understanding of the low recall of insertion variants with short-read based variant callers
title_short Towards a better understanding of the low recall of insertion variants with short-read based variant callers
title_sort towards a better understanding of the low recall of insertion variants with short-read based variant callers
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7640490/
https://www.ncbi.nlm.nih.gov/pubmed/33148192
http://dx.doi.org/10.1186/s12864-020-07125-5
work_keys_str_mv AT delagewesleyj towardsabetterunderstandingofthelowrecallofinsertionvariantswithshortreadbasedvariantcallers
AT thevenonjulien towardsabetterunderstandingofthelowrecallofinsertionvariantswithshortreadbasedvariantcallers
AT lemaitreclaire towardsabetterunderstandingofthelowrecallofinsertionvariantswithshortreadbasedvariantcallers