Cargando…

Insights from a genome-wide truth set of tandem repeat variation

Tools for genotyping tandem repeats (TRs) from short read sequencing data have improved significantly over the past decade. Extensive comparisons of these tools to gold standard diagnostic methods like RP-PCR have confirmed their accuracy for tens to hundreds of well-studied loci. However, a scarcit...

Descripción completa

Detalles Bibliográficos
Autores principales: Weisburd, Ben, Tiao, Grace, Rehm, Heidi L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10197592/
https://www.ncbi.nlm.nih.gov/pubmed/37214979
http://dx.doi.org/10.1101/2023.05.05.539588
_version_ 1785044580007149568
author Weisburd, Ben
Tiao, Grace
Rehm, Heidi L.
author_facet Weisburd, Ben
Tiao, Grace
Rehm, Heidi L.
author_sort Weisburd, Ben
collection PubMed
description Tools for genotyping tandem repeats (TRs) from short read sequencing data have improved significantly over the past decade. Extensive comparisons of these tools to gold standard diagnostic methods like RP-PCR have confirmed their accuracy for tens to hundreds of well-studied loci. However, a scarcity of high-quality orthogonal truth data limited our ability to measure tool accuracy for the millions of other loci throughout the genome. To address this, we developed a TR truth set based on the Synthetic Diploid Benchmark (SynDip). By identifying the subset of insertions and deletions that represent TR expansions or contractions with motifs between 2 and 50 base pairs, we obtained accurate genotypes for 139,795 pure and 6,845 interrupted repeats in a single diploid sample. Our approach did not require running existing genotyping tools on short read or long read sequencing data and provided an alternative, more accurate view of tandem repeat variation. We applied this truth set to compare the strengths and weaknesses of widely-used tools for genotyping TRs, evaluated the completeness of existing genome-wide TR catalogs, and explored the properties of tandem repeat variation throughout the genome. We found that, without filtering, ExpansionHunter had higher accuracy than GangSTR and HipSTR over a wide range of motifs and allele sizes. Also, when errors in allele size occurred, ExpansionHunter tended to overestimate expansion sizes, while GangSTR tended to underestimate them. Additionally, we saw that widely-used TR catalogs miss between 16% and 41% of variant loci in the truth set. These results suggest that genome-wide analyses would benefit from genotyping a larger set of loci as well as further tool development that builds on the strengths of current algorithms. To that end, we developed a new catalog of 2.8 million loci that captures 95% of variant loci in the truth set, and created a modified version of ExpansionHunter that runs 2 to 3x faster than the original while producing the same output.
format Online
Article
Text
id pubmed-10197592
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-101975922023-05-20 Insights from a genome-wide truth set of tandem repeat variation Weisburd, Ben Tiao, Grace Rehm, Heidi L. bioRxiv Article Tools for genotyping tandem repeats (TRs) from short read sequencing data have improved significantly over the past decade. Extensive comparisons of these tools to gold standard diagnostic methods like RP-PCR have confirmed their accuracy for tens to hundreds of well-studied loci. However, a scarcity of high-quality orthogonal truth data limited our ability to measure tool accuracy for the millions of other loci throughout the genome. To address this, we developed a TR truth set based on the Synthetic Diploid Benchmark (SynDip). By identifying the subset of insertions and deletions that represent TR expansions or contractions with motifs between 2 and 50 base pairs, we obtained accurate genotypes for 139,795 pure and 6,845 interrupted repeats in a single diploid sample. Our approach did not require running existing genotyping tools on short read or long read sequencing data and provided an alternative, more accurate view of tandem repeat variation. We applied this truth set to compare the strengths and weaknesses of widely-used tools for genotyping TRs, evaluated the completeness of existing genome-wide TR catalogs, and explored the properties of tandem repeat variation throughout the genome. We found that, without filtering, ExpansionHunter had higher accuracy than GangSTR and HipSTR over a wide range of motifs and allele sizes. Also, when errors in allele size occurred, ExpansionHunter tended to overestimate expansion sizes, while GangSTR tended to underestimate them. Additionally, we saw that widely-used TR catalogs miss between 16% and 41% of variant loci in the truth set. These results suggest that genome-wide analyses would benefit from genotyping a larger set of loci as well as further tool development that builds on the strengths of current algorithms. To that end, we developed a new catalog of 2.8 million loci that captures 95% of variant loci in the truth set, and created a modified version of ExpansionHunter that runs 2 to 3x faster than the original while producing the same output. Cold Spring Harbor Laboratory 2023-05-08 /pmc/articles/PMC10197592/ /pubmed/37214979 http://dx.doi.org/10.1101/2023.05.05.539588 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Weisburd, Ben
Tiao, Grace
Rehm, Heidi L.
Insights from a genome-wide truth set of tandem repeat variation
title Insights from a genome-wide truth set of tandem repeat variation
title_full Insights from a genome-wide truth set of tandem repeat variation
title_fullStr Insights from a genome-wide truth set of tandem repeat variation
title_full_unstemmed Insights from a genome-wide truth set of tandem repeat variation
title_short Insights from a genome-wide truth set of tandem repeat variation
title_sort insights from a genome-wide truth set of tandem repeat variation
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10197592/
https://www.ncbi.nlm.nih.gov/pubmed/37214979
http://dx.doi.org/10.1101/2023.05.05.539588
work_keys_str_mv AT weisburdben insightsfromagenomewidetruthsetoftandemrepeatvariation
AT tiaograce insightsfromagenomewidetruthsetoftandemrepeatvariation
AT rehmheidil insightsfromagenomewidetruthsetoftandemrepeatvariation