Cargando…
Insights from a genome-wide truth set of tandem repeat variation
Tools for genotyping tandem repeats (TRs) from short read sequencing data have improved significantly over the past decade. Extensive comparisons of these tools to gold standard diagnostic methods like RP-PCR have confirmed their accuracy for tens to hundreds of well-studied loci. However, a scarcit...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10197592/ https://www.ncbi.nlm.nih.gov/pubmed/37214979 http://dx.doi.org/10.1101/2023.05.05.539588 |
_version_ | 1785044580007149568 |
---|---|
author | Weisburd, Ben Tiao, Grace Rehm, Heidi L. |
author_facet | Weisburd, Ben Tiao, Grace Rehm, Heidi L. |
author_sort | Weisburd, Ben |
collection | PubMed |
description | Tools for genotyping tandem repeats (TRs) from short read sequencing data have improved significantly over the past decade. Extensive comparisons of these tools to gold standard diagnostic methods like RP-PCR have confirmed their accuracy for tens to hundreds of well-studied loci. However, a scarcity of high-quality orthogonal truth data limited our ability to measure tool accuracy for the millions of other loci throughout the genome. To address this, we developed a TR truth set based on the Synthetic Diploid Benchmark (SynDip). By identifying the subset of insertions and deletions that represent TR expansions or contractions with motifs between 2 and 50 base pairs, we obtained accurate genotypes for 139,795 pure and 6,845 interrupted repeats in a single diploid sample. Our approach did not require running existing genotyping tools on short read or long read sequencing data and provided an alternative, more accurate view of tandem repeat variation. We applied this truth set to compare the strengths and weaknesses of widely-used tools for genotyping TRs, evaluated the completeness of existing genome-wide TR catalogs, and explored the properties of tandem repeat variation throughout the genome. We found that, without filtering, ExpansionHunter had higher accuracy than GangSTR and HipSTR over a wide range of motifs and allele sizes. Also, when errors in allele size occurred, ExpansionHunter tended to overestimate expansion sizes, while GangSTR tended to underestimate them. Additionally, we saw that widely-used TR catalogs miss between 16% and 41% of variant loci in the truth set. These results suggest that genome-wide analyses would benefit from genotyping a larger set of loci as well as further tool development that builds on the strengths of current algorithms. To that end, we developed a new catalog of 2.8 million loci that captures 95% of variant loci in the truth set, and created a modified version of ExpansionHunter that runs 2 to 3x faster than the original while producing the same output. |
format | Online Article Text |
id | pubmed-10197592 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory |
record_format | MEDLINE/PubMed |
spelling | pubmed-101975922023-05-20 Insights from a genome-wide truth set of tandem repeat variation Weisburd, Ben Tiao, Grace Rehm, Heidi L. bioRxiv Article Tools for genotyping tandem repeats (TRs) from short read sequencing data have improved significantly over the past decade. Extensive comparisons of these tools to gold standard diagnostic methods like RP-PCR have confirmed their accuracy for tens to hundreds of well-studied loci. However, a scarcity of high-quality orthogonal truth data limited our ability to measure tool accuracy for the millions of other loci throughout the genome. To address this, we developed a TR truth set based on the Synthetic Diploid Benchmark (SynDip). By identifying the subset of insertions and deletions that represent TR expansions or contractions with motifs between 2 and 50 base pairs, we obtained accurate genotypes for 139,795 pure and 6,845 interrupted repeats in a single diploid sample. Our approach did not require running existing genotyping tools on short read or long read sequencing data and provided an alternative, more accurate view of tandem repeat variation. We applied this truth set to compare the strengths and weaknesses of widely-used tools for genotyping TRs, evaluated the completeness of existing genome-wide TR catalogs, and explored the properties of tandem repeat variation throughout the genome. We found that, without filtering, ExpansionHunter had higher accuracy than GangSTR and HipSTR over a wide range of motifs and allele sizes. Also, when errors in allele size occurred, ExpansionHunter tended to overestimate expansion sizes, while GangSTR tended to underestimate them. Additionally, we saw that widely-used TR catalogs miss between 16% and 41% of variant loci in the truth set. These results suggest that genome-wide analyses would benefit from genotyping a larger set of loci as well as further tool development that builds on the strengths of current algorithms. To that end, we developed a new catalog of 2.8 million loci that captures 95% of variant loci in the truth set, and created a modified version of ExpansionHunter that runs 2 to 3x faster than the original while producing the same output. Cold Spring Harbor Laboratory 2023-05-08 /pmc/articles/PMC10197592/ /pubmed/37214979 http://dx.doi.org/10.1101/2023.05.05.539588 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. |
spellingShingle | Article Weisburd, Ben Tiao, Grace Rehm, Heidi L. Insights from a genome-wide truth set of tandem repeat variation |
title | Insights from a genome-wide truth set of tandem repeat variation |
title_full | Insights from a genome-wide truth set of tandem repeat variation |
title_fullStr | Insights from a genome-wide truth set of tandem repeat variation |
title_full_unstemmed | Insights from a genome-wide truth set of tandem repeat variation |
title_short | Insights from a genome-wide truth set of tandem repeat variation |
title_sort | insights from a genome-wide truth set of tandem repeat variation |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10197592/ https://www.ncbi.nlm.nih.gov/pubmed/37214979 http://dx.doi.org/10.1101/2023.05.05.539588 |
work_keys_str_mv | AT weisburdben insightsfromagenomewidetruthsetoftandemrepeatvariation AT tiaograce insightsfromagenomewidetruthsetoftandemrepeatvariation AT rehmheidil insightsfromagenomewidetruthsetoftandemrepeatvariation |