Cargando…

Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree

BACKGROUND: Two types of approaches are mainly considered for the repeat number estimation in short tandem repeat (STR) regions from high-throughput sequencing data: approaches directly counting repeat patterns included in sequence reads spanning the region and approaches based on detecting the diff...

Descripción completa

Detalles Bibliográficos
Autores principales: Kojima, Kaname, Kawai, Yosuke, Nariai, Naoki, Mimori, Takahiro, Hasegawa, Takanori, Nagasaki, Masao
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009668/
https://www.ncbi.nlm.nih.gov/pubmed/27586631
http://dx.doi.org/10.1186/s12864-016-2821-0
_version_ 1782451556677844992
author Kojima, Kaname
Kawai, Yosuke
Nariai, Naoki
Mimori, Takahiro
Hasegawa, Takanori
Nagasaki, Masao
author_facet Kojima, Kaname
Kawai, Yosuke
Nariai, Naoki
Mimori, Takahiro
Hasegawa, Takanori
Nagasaki, Masao
author_sort Kojima, Kaname
collection PubMed
description BACKGROUND: Two types of approaches are mainly considered for the repeat number estimation in short tandem repeat (STR) regions from high-throughput sequencing data: approaches directly counting repeat patterns included in sequence reads spanning the region and approaches based on detecting the difference between the insert size inferred from aligned paired-end reads and the actual insert size. Although the accuracy of repeat numbers estimated with the former approaches is high, the size of target STR regions is limited to the length of sequence reads. On the other hand, the latter approaches can handle STR regions longer than the length of sequence reads. However, repeat numbers estimated with the latter approaches is less accurate than those with the former approaches. RESULTS: We proposed a new statistical model named coalescentSTR that estimates repeat numbers from paired-end read distances for multiple individuals simultaneously by connecting the read generative model for each individual with their genealogy. In the model, the genealogy is represented by handling coalescent trees as hidden variables, and the summation of the hidden variables is taken on coalescent trees sampled based on phased genotypes located around a target STR region with Markov chain Monte Carlo. In the sampled coalescent trees, repeat number information from insert size data is propagated, and more accurate estimation of repeat numbers is expected for STR regions longer than the length of sequence reads. For finding the repeat numbers maximizing the likelihood of the model on the estimation of repeat numbers, we proposed a state-of-the-art belief propagation algorithm on sampled coalescent trees. CONCLUSIONS: We verified the effectiveness of the proposed approach from the comparison with existing methods by using simulation datasets and real whole genome and whole exome data for HapMap individuals analyzed in the 1000 Genomes Project.
format Online
Article
Text
id pubmed-5009668
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-50096682016-09-09 Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree Kojima, Kaname Kawai, Yosuke Nariai, Naoki Mimori, Takahiro Hasegawa, Takanori Nagasaki, Masao BMC Genomics Research BACKGROUND: Two types of approaches are mainly considered for the repeat number estimation in short tandem repeat (STR) regions from high-throughput sequencing data: approaches directly counting repeat patterns included in sequence reads spanning the region and approaches based on detecting the difference between the insert size inferred from aligned paired-end reads and the actual insert size. Although the accuracy of repeat numbers estimated with the former approaches is high, the size of target STR regions is limited to the length of sequence reads. On the other hand, the latter approaches can handle STR regions longer than the length of sequence reads. However, repeat numbers estimated with the latter approaches is less accurate than those with the former approaches. RESULTS: We proposed a new statistical model named coalescentSTR that estimates repeat numbers from paired-end read distances for multiple individuals simultaneously by connecting the read generative model for each individual with their genealogy. In the model, the genealogy is represented by handling coalescent trees as hidden variables, and the summation of the hidden variables is taken on coalescent trees sampled based on phased genotypes located around a target STR region with Markov chain Monte Carlo. In the sampled coalescent trees, repeat number information from insert size data is propagated, and more accurate estimation of repeat numbers is expected for STR regions longer than the length of sequence reads. For finding the repeat numbers maximizing the likelihood of the model on the estimation of repeat numbers, we proposed a state-of-the-art belief propagation algorithm on sampled coalescent trees. CONCLUSIONS: We verified the effectiveness of the proposed approach from the comparison with existing methods by using simulation datasets and real whole genome and whole exome data for HapMap individuals analyzed in the 1000 Genomes Project. BioMed Central 2016-08-31 /pmc/articles/PMC5009668/ /pubmed/27586631 http://dx.doi.org/10.1186/s12864-016-2821-0 Text en © The Author(s) 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Kojima, Kaname
Kawai, Yosuke
Nariai, Naoki
Mimori, Takahiro
Hasegawa, Takanori
Nagasaki, Masao
Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree
title Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree
title_full Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree
title_fullStr Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree
title_full_unstemmed Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree
title_short Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree
title_sort short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5009668/
https://www.ncbi.nlm.nih.gov/pubmed/27586631
http://dx.doi.org/10.1186/s12864-016-2821-0
work_keys_str_mv AT kojimakaname shorttandemrepeatnumberestimationfrompairedendreadsformultipleindividualsbyconsideringcoalescenttree
AT kawaiyosuke shorttandemrepeatnumberestimationfrompairedendreadsformultipleindividualsbyconsideringcoalescenttree
AT nariainaoki shorttandemrepeatnumberestimationfrompairedendreadsformultipleindividualsbyconsideringcoalescenttree
AT mimoritakahiro shorttandemrepeatnumberestimationfrompairedendreadsformultipleindividualsbyconsideringcoalescenttree
AT hasegawatakanori shorttandemrepeatnumberestimationfrompairedendreadsformultipleindividualsbyconsideringcoalescenttree
AT nagasakimasao shorttandemrepeatnumberestimationfrompairedendreadsformultipleindividualsbyconsideringcoalescenttree