Cargando…

Insights from a genome-wide truth set of tandem repeat variation

Tools for genotyping tandem repeats (TRs) from short read sequencing data have improved significantly over the past decade. Extensive comparisons of these tools to gold standard diagnostic methods like RP-PCR have confirmed their accuracy for tens to hundreds of well-studied loci. However, a scarcit...

Descripción completa

Detalles Bibliográficos
Autores principales:	Weisburd, Ben, Tiao, Grace, Rehm, Heidi L.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10197592/ https://www.ncbi.nlm.nih.gov/pubmed/37214979 http://dx.doi.org/10.1101/2023.05.05.539588

_version_	1785044580007149568
author	Weisburd, Ben Tiao, Grace Rehm, Heidi L.
author_facet	Weisburd, Ben Tiao, Grace Rehm, Heidi L.
author_sort	Weisburd, Ben
collection	PubMed
description	Tools for genotyping tandem repeats (TRs) from short read sequencing data have improved significantly over the past decade. Extensive comparisons of these tools to gold standard diagnostic methods like RP-PCR have confirmed their accuracy for tens to hundreds of well-studied loci. However, a scarcity of high-quality orthogonal truth data limited our ability to measure tool accuracy for the millions of other loci throughout the genome. To address this, we developed a TR truth set based on the Synthetic Diploid Benchmark (SynDip). By identifying the subset of insertions and deletions that represent TR expansions or contractions with motifs between 2 and 50 base pairs, we obtained accurate genotypes for 139,795 pure and 6,845 interrupted repeats in a single diploid sample. Our approach did not require running existing genotyping tools on short read or long read sequencing data and provided an alternative, more accurate view of tandem repeat variation. We applied this truth set to compare the strengths and weaknesses of widely-used tools for genotyping TRs, evaluated the completeness of existing genome-wide TR catalogs, and explored the properties of tandem repeat variation throughout the genome. We found that, without filtering, ExpansionHunter had higher accuracy than GangSTR and HipSTR over a wide range of motifs and allele sizes. Also, when errors in allele size occurred, ExpansionHunter tended to overestimate expansion sizes, while GangSTR tended to underestimate them. Additionally, we saw that widely-used TR catalogs miss between 16% and 41% of variant loci in the truth set. These results suggest that genome-wide analyses would benefit from genotyping a larger set of loci as well as further tool development that builds on the strengths of current algorithms. To that end, we developed a new catalog of 2.8 million loci that captures 95% of variant loci in the truth set, and created a modified version of ExpansionHunter that runs 2 to 3x faster than the original while producing the same output.
format	Online Article Text
id	pubmed-10197592
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Cold Spring Harbor Laboratory
record_format	MEDLINE/PubMed
spelling	pubmed-101975922023-05-20 Insights from a genome-wide truth set of tandem repeat variation Weisburd, Ben Tiao, Grace Rehm, Heidi L. bioRxiv Article Tools for genotyping tandem repeats (TRs) from short read sequencing data have improved significantly over the past decade. Extensive comparisons of these tools to gold standard diagnostic methods like RP-PCR have confirmed their accuracy for tens to hundreds of well-studied loci. However, a scarcity of high-quality orthogonal truth data limited our ability to measure tool accuracy for the millions of other loci throughout the genome. To address this, we developed a TR truth set based on the Synthetic Diploid Benchmark (SynDip). By identifying the subset of insertions and deletions that represent TR expansions or contractions with motifs between 2 and 50 base pairs, we obtained accurate genotypes for 139,795 pure and 6,845 interrupted repeats in a single diploid sample. Our approach did not require running existing genotyping tools on short read or long read sequencing data and provided an alternative, more accurate view of tandem repeat variation. We applied this truth set to compare the strengths and weaknesses of widely-used tools for genotyping TRs, evaluated the completeness of existing genome-wide TR catalogs, and explored the properties of tandem repeat variation throughout the genome. We found that, without filtering, ExpansionHunter had higher accuracy than GangSTR and HipSTR over a wide range of motifs and allele sizes. Also, when errors in allele size occurred, ExpansionHunter tended to overestimate expansion sizes, while GangSTR tended to underestimate them. Additionally, we saw that widely-used TR catalogs miss between 16% and 41% of variant loci in the truth set. These results suggest that genome-wide analyses would benefit from genotyping a larger set of loci as well as further tool development that builds on the strengths of current algorithms. To that end, we developed a new catalog of 2.8 million loci that captures 95% of variant loci in the truth set, and created a modified version of ExpansionHunter that runs 2 to 3x faster than the original while producing the same output. Cold Spring Harbor Laboratory 2023-05-08 /pmc/articles/PMC10197592/ /pubmed/37214979 http://dx.doi.org/10.1101/2023.05.05.539588 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle	Article Weisburd, Ben Tiao, Grace Rehm, Heidi L. Insights from a genome-wide truth set of tandem repeat variation
title	Insights from a genome-wide truth set of tandem repeat variation
title_full	Insights from a genome-wide truth set of tandem repeat variation
title_fullStr	Insights from a genome-wide truth set of tandem repeat variation
title_full_unstemmed	Insights from a genome-wide truth set of tandem repeat variation
title_short	Insights from a genome-wide truth set of tandem repeat variation
title_sort	insights from a genome-wide truth set of tandem repeat variation
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10197592/ https://www.ncbi.nlm.nih.gov/pubmed/37214979 http://dx.doi.org/10.1101/2023.05.05.539588
work_keys_str_mv	AT weisburdben insightsfromagenomewidetruthsetoftandemrepeatvariation AT tiaograce insightsfromagenomewidetruthsetoftandemrepeatvariation AT rehmheidil insightsfromagenomewidetruthsetoftandemrepeatvariation

Insights from a genome-wide truth set of tandem repeat variation

Ejemplares similares