Cargando…
Towards a better understanding of the low recall of insertion variants with short-read based variant callers
BACKGROUND: Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has comple...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7640490/ https://www.ncbi.nlm.nih.gov/pubmed/33148192 http://dx.doi.org/10.1186/s12864-020-07125-5 |
_version_ | 1783605756407316480 |
---|---|
author | Delage, Wesley J. Thevenon, Julien Lemaitre, Claire |
author_facet | Delage, Wesley J. Thevenon, Julien Lemaitre, Claire |
author_sort | Delage, Wesley J. |
collection | PubMed |
description | BACKGROUND: Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 28% could be discovered with short-read based tools. RESULTS: In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several structural variant callers. We showed that most reported insertions exhibited characteristics that may interfere with their discovery: 63% were tandem repeat expansions, 38% contained homology larger than 10 bp within their breakpoint junctions and 70% were located in simple repeats. Consequently, the recall of short-read based variant callers was significantly lower for such insertions (6% for tandem repeats vs 56% for mobile element insertions). Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested structural variant callers, and they highlighted the lack of sequence resolution for most insertion calls. CONCLUSIONS: Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their combinations. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (doi:10.1186/s12864-020-07125-5). |
format | Online Article Text |
id | pubmed-7640490 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-76404902020-11-04 Towards a better understanding of the low recall of insertion variants with short-read based variant callers Delage, Wesley J. Thevenon, Julien Lemaitre, Claire BMC Genomics Research Article BACKGROUND: Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 28% could be discovered with short-read based tools. RESULTS: In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several structural variant callers. We showed that most reported insertions exhibited characteristics that may interfere with their discovery: 63% were tandem repeat expansions, 38% contained homology larger than 10 bp within their breakpoint junctions and 70% were located in simple repeats. Consequently, the recall of short-read based variant callers was significantly lower for such insertions (6% for tandem repeats vs 56% for mobile element insertions). Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested structural variant callers, and they highlighted the lack of sequence resolution for most insertion calls. CONCLUSIONS: Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their combinations. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at (doi:10.1186/s12864-020-07125-5). BioMed Central 2020-11-04 /pmc/articles/PMC7640490/ /pubmed/33148192 http://dx.doi.org/10.1186/s12864-020-07125-5 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Article Delage, Wesley J. Thevenon, Julien Lemaitre, Claire Towards a better understanding of the low recall of insertion variants with short-read based variant callers |
title | Towards a better understanding of the low recall of insertion variants with short-read based variant callers |
title_full | Towards a better understanding of the low recall of insertion variants with short-read based variant callers |
title_fullStr | Towards a better understanding of the low recall of insertion variants with short-read based variant callers |
title_full_unstemmed | Towards a better understanding of the low recall of insertion variants with short-read based variant callers |
title_short | Towards a better understanding of the low recall of insertion variants with short-read based variant callers |
title_sort | towards a better understanding of the low recall of insertion variants with short-read based variant callers |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7640490/ https://www.ncbi.nlm.nih.gov/pubmed/33148192 http://dx.doi.org/10.1186/s12864-020-07125-5 |
work_keys_str_mv | AT delagewesleyj towardsabetterunderstandingofthelowrecallofinsertionvariantswithshortreadbasedvariantcallers AT thevenonjulien towardsabetterunderstandingofthelowrecallofinsertionvariantswithshortreadbasedvariantcallers AT lemaitreclaire towardsabetterunderstandingofthelowrecallofinsertionvariantswithshortreadbasedvariantcallers |