Cargando…

Accurate typing of short tandem repeats from genome-wide sequencing data and its applications

Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem rep...

Descripción completa

Detalles Bibliográficos
Autores principales: Fungtammasan, Arkarachai, Ananda, Guruprasad, Hile, Suzanne E., Su, Marcia Shu-Wei, Sun, Chen, Harris, Robert, Medvedev, Paul, Eckert, Kristin, Makova, Kateryna D.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4417121/
https://www.ncbi.nlm.nih.gov/pubmed/25823460
http://dx.doi.org/10.1101/gr.185892.114
_version_ 1782369309486481408
author Fungtammasan, Arkarachai
Ananda, Guruprasad
Hile, Suzanne E.
Su, Marcia Shu-Wei
Sun, Chen
Harris, Robert
Medvedev, Paul
Eckert, Kristin
Makova, Kateryna D.
author_facet Fungtammasan, Arkarachai
Ananda, Guruprasad
Hile, Suzanne E.
Su, Marcia Shu-Wei
Sun, Chen
Harris, Robert
Medvedev, Paul
Eckert, Kristin
Makova, Kateryna D.
author_sort Fungtammasan, Arkarachai
collection PubMed
description Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCR-free protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution.
format Online
Article
Text
id pubmed-4417121
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-44171212015-11-01 Accurate typing of short tandem repeats from genome-wide sequencing data and its applications Fungtammasan, Arkarachai Ananda, Guruprasad Hile, Suzanne E. Su, Marcia Shu-Wei Sun, Chen Harris, Robert Medvedev, Paul Eckert, Kristin Makova, Kateryna D. Genome Res Method Short tandem repeats (STRs) are implicated in dozens of human genetic diseases and contribute significantly to genome variation and instability. Yet profiling STRs from short-read sequencing data is challenging because of their high sequencing error rates. Here, we developed STR-FM, short tandem repeat profiling using flank-based mapping, a computational pipeline that can detect the full spectrum of STR alleles from short-read data, can adapt to emerging read-mapping algorithms, and can be applied to heterogeneous genetic samples (e.g., tumors, viruses, and genomes of organelles). We used STR-FM to study STR error rates and patterns in publicly available human and in-house generated ultradeep plasmid sequencing data sets. We discovered that STRs sequenced with a PCR-free protocol have up to ninefold fewer errors than those sequenced with a PCR-containing protocol. We constructed an error correction model for genotyping STRs that can distinguish heterozygous alleles containing STRs with consecutive repeat numbers. Applying our model and pipeline to Illumina sequencing data with 100-bp reads, we could confidently genotype several disease-related long trinucleotide STRs. Utilizing this pipeline, for the first time we determined the genome-wide STR germline mutation rate from a deeply sequenced human pedigree. Additionally, we built a tool that recommends minimal sequencing depth for accurate STR genotyping, depending on repeat length and sequencing read length. The required read depth increases with STR length and is lower for a PCR-free protocol. This suite of tools addresses the pressing challenges surrounding STR genotyping, and thus is of wide interest to researchers investigating disease-related STRs and STR evolution. Cold Spring Harbor Laboratory Press 2015-05 /pmc/articles/PMC4417121/ /pubmed/25823460 http://dx.doi.org/10.1101/gr.185892.114 Text en © 2015 Fungtammasan et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle Method
Fungtammasan, Arkarachai
Ananda, Guruprasad
Hile, Suzanne E.
Su, Marcia Shu-Wei
Sun, Chen
Harris, Robert
Medvedev, Paul
Eckert, Kristin
Makova, Kateryna D.
Accurate typing of short tandem repeats from genome-wide sequencing data and its applications
title Accurate typing of short tandem repeats from genome-wide sequencing data and its applications
title_full Accurate typing of short tandem repeats from genome-wide sequencing data and its applications
title_fullStr Accurate typing of short tandem repeats from genome-wide sequencing data and its applications
title_full_unstemmed Accurate typing of short tandem repeats from genome-wide sequencing data and its applications
title_short Accurate typing of short tandem repeats from genome-wide sequencing data and its applications
title_sort accurate typing of short tandem repeats from genome-wide sequencing data and its applications
topic Method
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4417121/
https://www.ncbi.nlm.nih.gov/pubmed/25823460
http://dx.doi.org/10.1101/gr.185892.114
work_keys_str_mv AT fungtammasanarkarachai accuratetypingofshorttandemrepeatsfromgenomewidesequencingdataanditsapplications
AT anandaguruprasad accuratetypingofshorttandemrepeatsfromgenomewidesequencingdataanditsapplications
AT hilesuzannee accuratetypingofshorttandemrepeatsfromgenomewidesequencingdataanditsapplications
AT sumarciashuwei accuratetypingofshorttandemrepeatsfromgenomewidesequencingdataanditsapplications
AT sunchen accuratetypingofshorttandemrepeatsfromgenomewidesequencingdataanditsapplications
AT harrisrobert accuratetypingofshorttandemrepeatsfromgenomewidesequencingdataanditsapplications
AT medvedevpaul accuratetypingofshorttandemrepeatsfromgenomewidesequencingdataanditsapplications
AT eckertkristin accuratetypingofshorttandemrepeatsfromgenomewidesequencingdataanditsapplications
AT makovakaterynad accuratetypingofshorttandemrepeatsfromgenomewidesequencingdataanditsapplications