Cargando…
Genome-wide detection of short tandem repeat expansions by long-read sequencing
BACKGROUND: Short tandem repeat (STR), or “microsatellite”, is a tract of DNA in which a specific motif (typically < 10 base pairs) is repeated multiple times. STRs are abundant throughout the human genome, and specific repeat expansions may be associated with human diseases. Long-read sequencing...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7768641/ https://www.ncbi.nlm.nih.gov/pubmed/33371889 http://dx.doi.org/10.1186/s12859-020-03876-w |
Sumario: | BACKGROUND: Short tandem repeat (STR), or “microsatellite”, is a tract of DNA in which a specific motif (typically < 10 base pairs) is repeated multiple times. STRs are abundant throughout the human genome, and specific repeat expansions may be associated with human diseases. Long-read sequencing coupled with bioinformatics tools enables the estimation of repeat counts for STRs. However, with the exception of a few well-known disease-relevant STRs, normal ranges of repeat counts for most STRs in human populations are not well known, preventing the prioritization of STRs that may be associated with human diseases. RESULTS: In this study, we extend a computational tool RepeatHMM to infer normal ranges of 432,604 STRs using 21 long-read sequencing datasets on human genomes, and build a genomic-scale database called RepeatHMM-DB with normal repeat ranges for these STRs. Evaluation on 13 well-known repeats show that the inferred repeat ranges provide good estimation to repeat ranges reported in literature from population-scale studies. This database, together with a repeat expansion estimation tool such as RepeatHMM, enables genomic-scale scanning of repeat regions in newly sequenced genomes to identify disease-relevant repeat expansions. As a case study of using RepeatHMM-DB, we evaluate the CAG repeats of ATXN3 for 20 patients with spinocerebellar ataxia type 3 (SCA3) and 5 unaffected individuals, and correctly classify each individual. CONCLUSIONS: In summary, RepeatHMM-DB can facilitate prioritization and identification of disease-relevant STRs from whole-genome long-read sequencing data on patients with undiagnosed diseases. RepeatHMM-DB is incorporated into RepeatHMM and is available at https://github.com/WGLab/RepeatHMM. |
---|